<a href="https://colab.research.google.com/github/MacWorldPro/Module_34/blob/main/Text_Classification_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
newsgroups = fetch_20newsgroups(subset='all')
X, y = newsgroups.data, newsgroups.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Initialize and train the Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)

# Predict the test set results
y_pred = nb_classifier.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=newsgroups.target_names)

print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(report)


In [20]:
X_train_tfidf

<15076x148987 sparse matrix of type '<class 'numpy.float64'>'
	with 1637216 stored elements in Compressed Sparse Row format>

In [19]:
# Ensure the vectorizer and classifier are already defined as per the previous code

# New text input
new_text = ['''The primary issue is that Revanna's whereabouts are currently unknown.
The sex tapes
Revanna, the grandson of former Prime Minister HD Deve Gowda, is the central figure in a sexual abuse case. He departed from India on a diplomatic passport, reportedly to Germany, just one day after the Lok Sabha elections took place in his constituency.
Numerous women have accused the MP of violating their modesty and recording the acts on camera.
On April 27, at the request of the State Commission for Women, the Karnataka government formed a Special Investigation Team (SIT) to investigate the alleged video clips of sexual abuse and exploitation involving the Hassan MP.
Although the probe was ordered a day after the second phase of polling in the state, pen drives containing thousands of explicit video clips featuring multiple women had been circulating in Hassan and nearby areas well before the election.
Politics amid polls
The case has become a political flashpoint during the ongoing Lok Sabha elections, with the Congress criticising the BJP for aligning with a "tainted" MP and his party. The opposition parties have also targeted Prime Minister Narendra Modi for campaigning for him.
The BJP, on the other hand, has accused the Congress of allowing the accused to escape from the state.
Karnataka chief minister Siddaramaiah has once again written to PM Modi, requesting the cancellation of the MP's diplomatic passport.
Blue corner notice
The Interpol has already issued a 'Blue Corner Notice' to gather information on Prajwal Revanna's whereabouts, following a request from the SIT through the Central Bureau of Investigation (CBI).
A Special Court for Elected Representatives has also issued an arrest warrant against Prajwal Revanna, based on an application filed by the SIT.
Where is Prajwal?
Kumaraswamy, while addressing media persons on Wednesday, stated that Prajwal was not in contact with his father and MLA H D Revanna or anyone else.
"Where will I go search for him? If I go abroad, they will say I have gone to save Prajwal... he is not in contact with anyone... with the advice of some advocates, all these things have happened. In case, I had come to know about him leaving for abroad on April 27, I would have stopped him," he said.
"Prajwal had sought a week to appear before SIT, but it was denied and another rape case was filed against him. With all this he might be afraid (to come back)," he added.
Earlier this month, MEA spokesperson Randhir Jaiswal confirmed that Revanna traveled to Germany on a diplomatic passport without seeking political clearance for the trip.
"No political clearance was either sought from or issued by MEA in respect of the travel of the said MP to Germany," Jaiswal had stated.''']

# Transform the new text data using the same TF-IDF vectorizer
new_text_tfidf = vectorizer.transform(new_text)

# Predict the category of the new text data
predicted_category = nb_classifier.predict(new_text_tfidf)

# Get the category name
predicted_category_name = newsgroups.target_names[predicted_category[0]]

print(f'The predicted category for the text "{new_text[0]}" is: {predicted_category_name}')


The predicted category for the text "The primary issue is that Revanna's whereabouts are currently unknown.
The sex tapes
Revanna, the grandson of former Prime Minister HD Deve Gowda, is the central figure in a sexual abuse case. He departed from India on a diplomatic passport, reportedly to Germany, just one day after the Lok Sabha elections took place in his constituency.
Numerous women have accused the MP of violating their modesty and recording the acts on camera.
On April 27, at the request of the State Commission for Women, the Karnataka government formed a Special Investigation Team (SIT) to investigate the alleged video clips of sexual abuse and exploitation involving the Hassan MP.
Although the probe was ordered a day after the second phase of polling in the state, pen drives containing thousands of explicit video clips featuring multiple women had been circulating in Hassan and nearby areas well before the election.
Politics amid polls
The case has become a political flashpoi

In [21]:
X_train_tfidf

<15076x148987 sparse matrix of type '<class 'numpy.float64'>'
	with 1637216 stored elements in Compressed Sparse Row format>

In [22]:
import numpy as np

# Convert the sparse matrix to a dense matrix (caution: can be large)
X_train_tfidf_dense = X_train_tfidf.toarray()

# Print the shape of the dense matrix
print(f'Shape of X_train_tfidf_dense: {X_train_tfidf_dense.shape}')

# Print the first 5 rows and first 10 columns of the dense matrix
print(X_train_tfidf_dense[:5, :10])

# Optionally, print the feature names corresponding to the columns
feature_names = vectorizer.get_feature_names_out()
print('Feature names (first 10):', feature_names[:10])


Shape of X_train_tfidf_dense: (15076, 148987)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Feature names (first 10): ['00' '000' '0000' '00000' '00000000' '0000000004' '0000000005'
 '00000000b' '00000001' '00000001b']


### Understanding TF-IDF Features

TF-IDF stands for **Term Frequency-Inverse Document Frequency**. It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. This helps to adjust for the fact that some words appear more frequently in general (like "the", "is", "in") and should not be considered as important as context-specific words.

### Components of TF-IDF

1. **Term Frequency (TF)**:
   - The term frequency is simply the count of a term in a document. However, to prevent bias towards longer documents, it is usually normalized by the document length (i.e., the total number of terms in the document).

   \[
   \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
   \]

2. **Inverse Document Frequency (IDF)**:
   - The inverse document frequency is a measure of how much information the word provides, i.e., if it’s common or rare across all documents.

   \[
   \text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
   \]

   Words that appear in many documents have a low IDF, while words that appear in fewer documents have a high IDF.

3. **TF-IDF Score**:
   - The TF-IDF score is calculated as the product of TF and IDF.

   \[
   \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
   \]

### How TF-IDF Works

- **Normalization**: Each term's frequency is normalized to prevent bias towards longer documents.
- **Weighting**: Terms that are common across many documents have their importance reduced, while terms that are rare across the corpus are given higher importance.
- **Sparse Representation**: Typically, TF-IDF matrices are sparse because most terms do not appear in most documents.

### Advantages of TF-IDF

- **Relevance**: TF-IDF helps to filter out common words and highlight the important words in the context of the document.
- **Efficiency**: It is computationally efficient and simple to implement.
- **Effectiveness**: It often provides better results compared to raw term frequencies, especially in scenarios like information retrieval and text classification.

### Example Calculation

Let's assume we have the following documents:
1. "the cat sat on the mat"
2. "the dog sat on the log"
3. "the cat chased the dog"

**Step-by-Step Calculation**:

1. **Term Frequency (TF)**:
   - For the word "cat" in the first document:
     \[
     \text{TF}(\text{"cat"}, d1) = \frac{1}{6}
     \]
   - For the word "the" in the first document:
     \[
     \text{TF}(\text{"the"}, d1) = \frac{2}{6}
     \]

2. **Document Frequency (DF)**:
   - "cat" appears in 2 documents.
   - "the" appears in all 3 documents.

3. **Inverse Document Frequency (IDF)**:
   - For the word "cat":
     \[
     \text{IDF}(\text{"cat"}) = \log \left( \frac{3}{2} \right) \approx 0.176
     \]
   - For the word "the":
     \[
     \text{IDF}(\text{"the"}) = \log \left( \frac{3}{3} \right) = 0
     \]

4. **TF-IDF Score**:
   - For "cat" in the first document:
     \[
     \text{TF-IDF}(\text{"cat"}, d1) = \frac{1}{6} \times 0.176 \approx 0.029
     \]
   - For "the" in the first document:
     \[
     \text{TF-IDF}(\text{"the"}, d1) = \frac{2}{6} \times 0 = 0
     \]

### Visualizing TF-IDF Features

Consider the TF-IDF matrix as an example:

| Document \ Term | cat  | dog  | sat  | mat  | log  | chased | the  | on   |
|-----------------|------|------|------|------|------|--------|------|------|
| Document 1      | 0.029| 0    | ...  | ...  | ...  | ...    | 0    | ...  |
| Document 2      | 0    | ...  | ...  | ...  | ...  | ...    | 0    | ...  |
| Document 3      | ...  | ...  | ...  | ...  | ...  | ...    | 0    | ...  |

In practice, the actual values would depend on the full calculation across all terms and documents.

### Practical Usage in Code

In Python, using `TfidfVectorizer` from `sklearn.feature_extraction.text`, this can be easily implemented and applied as shown in the previous examples. The `TfidfVectorizer` will handle the calculation of TF-IDF scores for you.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog"
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names
feature_names = vectorizer.get_feature_names_out()

# Convert to dense matrix to inspect
dense_matrix = tfidf_matrix.toarray()

# Print the TF-IDF matrix
print("TF-IDF Matrix:")
print(dense_matrix)

# Print feature names
print("Feature Names:")
print(feature_names)
```

### Output Interpretation

- **TF-IDF Matrix**: Each row corresponds to a document and each column corresponds to a term. The values are the TF-IDF scores.
- **Feature Names**: These are the terms corresponding to each column of the matrix.

By using TF-IDF, we can effectively weigh the terms in documents for various tasks such as text classification, clustering, and information retrieval, leading to more meaningful and accurate results.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog"
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names
feature_names = vectorizer.get_feature_names_out()

# Convert to dense matrix to inspect
dense_matrix = tfidf_matrix.toarray()

# Print the TF-IDF matrix
print("TF-IDF Matrix:")
print(dense_matrix)

# Print feature names
print("Feature Names:")
print(feature_names)


TF-IDF Matrix:
[[0.37420726 0.         0.         0.         0.49203758 0.37420726
  0.37420726 0.58121064]
 [0.         0.         0.37420726 0.49203758 0.         0.37420726
  0.37420726 0.58121064]
 [0.40352536 0.53058735 0.40352536 0.         0.         0.
  0.         0.62674687]]
Feature Names:
['cat' 'chased' 'dog' 'log' 'mat' 'on' 'sat' 'the']
