# Globe Tossing - A Bayesian Learning Exercise

## Introduction
Suppose you have a globe representing our planet, the Earth. This version of the world is small enough to hold in your hands. You are curious how much of the surface is covered in water. You adopt the following strategy: You will toss the globe up in the air. When you catch it, you will record whether or not the surface under your right index finger is water or land. Then you toss the globe up in the air again and repeat the procedure. This strategy generates a sequence of samples from the globe. The first nine samples could look like the following sequence: $\mathcal{D} = (W L W W W L W L W)$




## Objective:
In this task, you will implement a Bayesian updating function to estimate the proportion of water on Earth based on sequence of tosses you will generate. You will also create a plot to visualize how the Bayesian model learns and refines its estimate as more data is collected.

In [None]:
import numpy as np
import matplotlib.pyplot as plt


## Instructions:

1. Initialize your prior belief about the proportion $p$ of water on Earth. Represent this belief as a uniform distribution on the interval [0, 1]. You can use the `numpy` library to create an array of 1000 evenly spaced values between 0 and 1, representing potential proportions of water.

In [None]:
# Step 1: Initialize the prior belief

proportions =  # sol
prior =  # sol


2. Create a function called `likelihood` that takes two arguments: the proportion of water and an observed outcome (either 'W' or 'L'). The function should return the likelihood of observing the outcome given the proportion of water. For example, if the proportion of water is 0.6 and the outcome is 'W', the likelihood is 0.6.

In [None]:
# Step 2: Create the likelihood function

def likelihood(proportion, outcome):
    # sol


3. Create a function called `bayesian_update` that takes three arguments: the prior distribution, the proportion of water array, and the observed outcome. The function should perform the Bayesian update by calculating the likelihood of the observed outcome for each proportion in the array, and then multiplying it by the corresponding prior value. Normalize the resulting distribution so that it sums to 1. This normalized distribution is your new posterior distribution.

In [None]:
# Step 3: Create the bayesian_update function

def bayesian_update(prior, proportions, outcome):
    # sol
    return posterior


4. Simulate a series of globe tosses with a known true proportion of water. You can use the `numpy random.choice` function to generate random outcomes ('W' or 'L') based on the true proportion. For this exercise, let's assume the true proportion of water is 0.71. Generate 9 toss outcomes.


In [None]:
# Step 4: Simulate globe tosses

true_proportion = 0.71
toss_outcomes =  # sol
print(toss_outcomes)


5. Update your prior belief iteratively using the `bayesian_update` function and the observed outcomes. Perform the updates after every toss and store the resulting posterior distributions.

In [None]:
# Step 5: Update prior belief iteratively and store posterior distributions

posteriors = []
toss_sequences = []
# sol


6. Create a plot to visualize how the Bayesian model learns and refines its estimate as more data is collected. On the x-axis, show the proportion of water, and on the y-axis, show the plausibility (probability density) of each proportion. Plot the prior distribution and the posterior distributions after each toss.

In [None]:
# Step 6: Plot the learning process in a grid of plots
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.flatten()

# sol

plt.tight_layout()
plt.show()


# Sentiment Analysis - A Naive Bayes Exercise

## Theoretical Understanding
Consider a document classification problem, where you are given a dataset of text documents with their corresponding categories. You are asked to classify new documents into one of the given categories using the Naive Bayes classifier.

1. Explain the Naive Bayes assumption in the context of document classification. Why is this assumption considered "naive"?

1 - Example solution: In the context of document classification, the Naive Bayes assumption is that the features (words in this case) are conditionally independent given the class. In other words, the presence of one word does not influence the presence of another word in a document, given the document's category. This assumption is considered "naive" because, in reality, words in a document often have dependencies and are not completely independent.

2. Write the formula for the Naive Bayes classifier in terms of the likelihood, prior, and posterior probabilities, and explain each term's meaning.

2 - Example Solution: The formula for the Naive Bayes classifier is: $
P(C_k|X) = \frac{P(X|C_k) \cdot P(C_k)}{P(X)}
$

where:

- $P(C_k|X)$ is the posterior probability of class $C_k$ given the features (words) $X$
- $P(X|C_k)$ is the likelihood of features (words) $X$ given class $C_k$
- $P(C_k)$ is the prior probability of class $C_k$
- $P(X)$ is the probability of features (words) $X$

3. Discuss how the Naive Bayes classifier handles cases where some words in a new document are not present in the training data. Explain the concept of "smoothing" and how it helps in this situation.

3 - Example Solution: When some words in a new document are not present in the training data, the likelihood of those words given a class becomes zero, causing the entire likelihood for that class to become zero. This can lead to incorrect classifications. Smoothing is a technique to handle this situation. One common method is Laplace smoothing, which adds a small constant α (usually 1) to each word's frequency count. This ensures that no word has a zero likelihood and allows the Naive Bayes classifier to make reasonable predictions even when encountering new words.

## Implementation

Implement a Naive Bayes classifier in Python to classify documents into one of two categories: "positive" or "negative." You are given a training dataset of movie reviews and their corresponding labels (either "positive" or "negative").

Your dataset consists of (input,output) pairs with the following shape:

(document, label) = (movie_review, positive/negative)

### Preprocessing the data
Write a function `preprocess(document)` that takes a single document as input and returns a list of words after tokenization, lowercase conversion, and removal of punctuation and stop words.

In [None]:
# Preprocess the document
def preprocess(document):


stop_words = {"a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he",
              "in", "is", "it", "its", "of", "on", "that", "the", "to", "was", "were", "will", "with"}
symbols = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'


def preprocess(document):
    # sol
    return words


### Implementing a Naive Bayes classifier
Write a function `train(training_data)` that takes the training dataset as input and returns a dictionary containing the prior probabilities for each class and the likelihoods for each word given each class. Apply Laplace smoothing with $α=1$.

In [None]:
# Train the Naive Bayes model

def train(training_data):
    model = {}
    # sol
    return model


Write a function `classify(document, model)` that takes a document and the model (dictionary) returned by the `train()` function as input and returns the predicted class ("positive" or "negative") for the given document.

In [None]:
# Classify a document using the Naive Bayes model

def classify(document, model):
    # sol
    return predicted_class


Test your implementation using the small dataset of movie reviews provided below, and provide the classification results for a few example reviews.

In [None]:
# Test the Naive Bayes model
training_data = [
    ("I absolutely loved this movie! The acting was superb.", "positive"),
    ("The movie had a great plot and fantastic performances.", "positive"),
    ("This is one of the best movies I have ever seen.", "positive"),
    ("The cinematography was amazing, and the story was captivating.", "positive"),
    ("This movie is an all-time favorite of mine.", "positive"),
    ("The acting and direction were top-notch.", "positive"),
    ("This film was a wonderful blend of humor and drama.", "positive"),
    ("The story was original and the characters were well-developed.", "positive"),
    ("A must-watch movie for everyone!", "positive"),
    ("The visual effects were stunning, and the soundtrack was perfect.", "positive"),
    ("Incredible performances by the entire cast.", "positive"),
    ("A heartwarming and inspiring story.", "positive"),
    ("An unforgettable cinematic experience.", "positive"),
    ("The movie had me on the edge of my seat!", "positive"),
    ("A thrilling and emotional rollercoaster.", "positive"),
    ("I can't stop thinking about how good this movie was.", "positive"),
    ("A timeless classic that I'll watch again and again.", "positive"),
    ("The movie was charming and full of surprises.", "positive"),
    ("The plot twists kept me guessing until the very end.", "positive"),
    ("An excellent adaptation of the book.", "positive"),
    ("I found this movie to be boring and uninspired.", "negative"),
    ("The plot was predictable, and the acting was mediocre.", "negative"),
    ("This movie was a complete waste of time.", "negative"),
    ("The characters were unlikable, and the story was dull.", "negative"),
    ("I can't believe I wasted two hours of my life watching this.", "negative"),
    ("The special effects were laughable, and the acting was wooden.", "negative"),
    ("I have never been so disappointed by a movie in my life.", "negative"),
    ("The dialogue was cheesy, and the storyline was confusing.", "negative"),
    ("The movie had no redeeming qualities whatsoever.", "negative"),
    ("The acting was terrible, and the plot made no sense.", "negative"),
    ("This film was a mess from start to finish.", "negative"),
    ("I would not recommend this movie to anyone.", "negative"),
    ("I walked out of the theater halfway through the movie.", "negative"),
    ("A lackluster performance by the lead actor.", "negative"),
    ("I couldn't wait for this movie to end.", "negative"),
    ("The movie tried too hard to be funny, but it just wasn't.", "negative"),
    ("I was so disappointed with the ending.", "negative"),
    ("The movie was extremely slow-paced and uneventful.", "negative"),
    ("Poorly written and poorly executed.", "negative"),
    ("The movie was forgettable and not worth watching.", "negative"),
]

model = train(training_data)

test_reviews = [
    ("This movie was a masterpiece with a fantastic plot and great acting!", "positive"),
    ("I didn't enjoy this movie at all. It was boring and predictable.", "negative"),
]

# sol


Comment on the performance of your model, any potential pitfalls or things you could implement to improve it? Keep your answer short. 

#SOL