# Part 2: Bayesian Probability Analysis

This notebook applies Bayesian probability to sentiment analysis using the IMDB movie reviews dataset.

**Keywords Selected:**
- **Positive sentiment**: awesome, great, incredible  
- **Negative sentiment**: garbage, worst, bad

**Computed probabilities for each keyword:**
- Prior: P(Positive)
- Likelihood: P(keyword|Positive) 
- Marginal: P(keyword)
- Posterior: P(Positive|keyword)

In [5]:
import pandas as pd

df = pd.read_csv(r"C:\Users\jeanm\ML\Formative_3\IMDB Dataset.csv\IMDB Dataset.csv")

print(df.head())
print(df.columns)
print(df["sentiment"].value_counts())
print("Total rows:", len(df))

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Index(['review', 'sentiment'], dtype='str')
sentiment
positive    25000
negative    25000
Name: count, dtype: int64
Total rows: 50000


In [6]:
total_reviews = len(df)
positive_reviews = len(df[df["sentiment"] == "positive"])

prior_positive = positive_reviews / total_reviews

print("Total reviews:", total_reviews)
print("Positive reviews:", positive_reviews)
print("P(Positive):", prior_positive)

Total reviews: 50000
Positive reviews: 25000
P(Positive): 0.5


In [7]:
keywords = ["awesome", "great", "incredible", "garbage", "worst", "bad"]

In [8]:
results = []

for keyword in keywords:
    
    # Likelihood P(keyword | Positive)
    positive_df = df[df["sentiment"] == "positive"]
    
    count_keyword_positive = sum(
        keyword in review.lower()
        for review in positive_df["review"]
    )
    
    likelihood = count_keyword_positive / len(positive_df)
    
    # Marginal P(keyword)
    count_keyword_total = sum(
        keyword in review.lower()
        for review in df["review"]
    )
    
    marginal = count_keyword_total / total_reviews
    
    # Posterior P(Positive | keyword)
    if marginal > 0:
        posterior = (likelihood * prior_positive) / marginal
    else:
        posterior = 0
    
    results.append((keyword, likelihood, marginal, posterior))

In [9]:
print("\nBayesian Probability Results:\n")

for keyword, likelihood, marginal, posterior in results:
    print(f"Keyword: {keyword}")
    print(f"P(Positive): {prior_positive:.4f}")
    print(f"P({keyword} | Positive): {likelihood:.4f}")
    print(f"P({keyword}): {marginal:.4f}")
    print(f"P(Positive | {keyword}): {posterior:.4f}")
    print("-" * 40)


Bayesian Probability Results:

Keyword: awesome
P(Positive): 0.5000
P(awesome | Positive): 0.0254
P(awesome): 0.0177
P(Positive | awesome): 0.7180
----------------------------------------
Keyword: great
P(Positive): 0.5000
P(great | Positive): 0.3712
P(great): 0.2761
P(Positive | great): 0.6722
----------------------------------------
Keyword: incredible
P(Positive): 0.5000
P(incredible | Positive): 0.0292
P(incredible): 0.0197
P(Positive | incredible): 0.7421
----------------------------------------
Keyword: garbage
P(Positive): 0.5000
P(garbage | Positive): 0.0050
P(garbage): 0.0174
P(Positive | garbage): 0.1447
----------------------------------------
Keyword: worst
P(Positive): 0.5000
P(worst | Positive): 0.0164
P(worst): 0.0887
P(Positive | worst): 0.0927
----------------------------------------
Keyword: bad
P(Positive): 0.5000
P(bad | Positive): 0.1303
P(bad): 0.2541
P(Positive | bad): 0.2565
----------------------------------------
