# Playing with concepts

This activity pretends that you play around with the concepts and python code, explore libraries, make experiments, and in general check theory with practice and with reality. To do that, you can use any tool you want including LLMs. Just try to have some findings after your experimentation process for most concepts we have seen. Use libraries like, NLTK, Spacy. Research how to implement the theory we have seen like ngrams, naive-bayes, language models...

Yes, it is very similar than exercise S05_3, so if you have started it, you can start from the code you already have. 

In [1]:
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Sample text
text = "This is a simple example to generate n-grams"

# Tokenize the text
tokens = word_tokenize(text)

# Generate bigrams (2-grams)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)

# Generate trigrams (3-grams)
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)


Bigrams: [('This', 'is'), ('is', 'a'), ('a', 'simple'), ('simple', 'example'), ('example', 'to'), ('to', 'generate'), ('generate', 'n-grams')]
Trigrams: [('This', 'is', 'a'), ('is', 'a', 'simple'), ('a', 'simple', 'example'), ('simple', 'example', 'to'), ('example', 'to', 'generate'), ('to', 'generate', 'n-grams')]


In [2]:
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize

# Load movie review data (positive and negative reviews)
nltk.download('movie_reviews')

# Create a feature extractor for text classification
def extract_features(words):
    return {word: True for word in words}

# Prepare the dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Split the dataset into training and testing sets
train_set = [(extract_features(d), c) for (d, c) in documents[:1500]]
test_set = [(extract_features(d), c) for (d, c) in documents[1500:]]

# Train a Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

# Evaluate on the test set
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Classifier accuracy: {accuracy * 100:.2f}%")


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\iagoc\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


Classifier accuracy: 94.20%
