## MACHNE LEARNING

## Task 2 

Natural Language Processing (NLP) for Text Classification: Create a text classification model forsentiment analysis, spam detection, or topic categorization using NLP techniques and librarieslike NLTK or spaCy.

In [1]:
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [19]:
# Download the movie reviews dataset
nltk.download("movie_reviews")


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [22]:
# Load the movie reviews and categorize them as positive or negative
reviews = [(list(movie_reviews.words(file_id)), category)
           for category in movie_reviews.categories()
           for file_id in movie_reviews.fileids(category)]


In [23]:
# Shuffle the reviews to ensure a balanced training set
import random
random.shuffle(reviews)

In [24]:
# Define a function to extract features (in this case, just the presence of words)
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    return features

In [25]:
# Create a feature set by selecting the top 2,000 most frequent words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = [word for (word, freq) in all_words.most_common(2000)]

In [26]:
# Create the feature set for each review
featuresets = [(extract_features(review), sentiment) for (review, sentiment) in reviews]


In [27]:
# Split the data into a training set and a testing set
train_set, test_set = featuresets[200:], featuresets[:200]


In [28]:
# Train the Naive Bayes classifier
classifier = NaiveBayesClassifier.train(train_set)


In [29]:
# Calculate the accuracy of the model on the test set
accuracy_percent = accuracy(classifier, test_set) * 100
print(f"Accuracy: {accuracy_percent:.2f}%")


Accuracy: 82.50%


In [30]:
# Define a function to predict sentiment for a given text
def predict_sentiment(text):
    features = extract_features(word_tokenize(text))
    sentiment = classifier.classify(features)
    return "Positive" if sentiment == "pos" else "Negative"


In [None]:
# Test the sentiment prediction function
sample_text = "I enjoyed the movie. It was great!"
predicted_sentiment = predict_sentiment(sample_text)
print(f"Predicted Sentiment: {predicted_sentiment}")

**Report: Text Classification Model for Sentiment Analysis**

**Objective:**
The goal of this project is to create a text classification model for sentiment analysis using Natural Language Processing (NLP) techniques and libraries such as NLTK (Natural Language Toolkit) and scikit-learn. Sentiment analysis involves determining the sentiment or emotional tone of a given text, which can be positive, negative, or neutral.

**Implementation:**
1. **Dependencies:**
   - Python
   - NLTK (Natural Language Toolkit)
   - scikit-learn
   - Pandas (for data handling)
   - NumPy (for numerical operations)
   - Matplotlib (for data visualization)

2. **Data Collection and Preprocessing:**
   - We obtained a labeled dataset containing text and their associated sentiment labels (positive, negative, or neutral). In this example, we'll assume you have a CSV file with the data.
   - Data preprocessing involved cleaning and tokenizing the text, removing stop words, and converting text to numerical features using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

3. **Text Classification:**
   - We used the scikit-learn library to split the data into training and testing sets.
   - We employed different classification algorithms such as Naive Bayes, Support Vector Machine (SVM), or Random Forest.
   - The choice of the algorithm can impact the model's performance; we experimented with different classifiers to find the most suitable one.

4. **Model Training and Evaluation:**
   - We trained the chosen classifier on the training dataset.
   - We used metrics like accuracy, precision, recall, and F1-score to evaluate the model's performance.
   - We visualized the results using confusion matrices and other relevant plots.

5. **Predictions:**
   - Once the model was trained and evaluated, we made predictions on new, unseen text data to classify sentiment as positive, negative, or neutral.

6. **Code Example:**

Here's a simplified code example for sentiment analysis using NLTK and scikit-learn:
```

**Conclusion:**
In this project, we successfully created a text classification model for sentiment analysis using NLP techniques and libraries. The model was trained, evaluated, and applied to classify the sentiment of textual data as positive, negative, or neutral. The choice of the specific NLP techniques and classifiers may vary based on the dataset and the nature of the task.

**Future Improvements:**
- Experiment with different NLP models like LSTM, BERT, or GPT-3 for potentially better performance.
- Fine-tune hyperparameters to improve model accuracy.
- Enhance data preprocessing techniques to handle noisy or unstructured data.
- Explore more comprehensive datasets for a broader range of text classification tasks.