# Part 2: Bayesian Probability Analysis

This notebook applies Bayesian probability to sentiment analysis using the IMDB movie reviews dataset.

**Keywords Selected:**
- **Positive sentiment**: awesome, great, incredible  
- **Negative sentiment**: garbage, worst, bad

**Computed probabilities for each keyword:**
- Prior: P(Positive)
- Likelihood: P(keyword|Positive) 
- Marginal: P(keyword)
- Posterior: P(Positive|keyword)

In [15]:
import pandas as pd

df = pd.read_csv(r"C:\Users\ituma\AppData\Local\Temp\b516848d-1e61-42de-bba3-2af4afd7b4be_imdb-dataset-of-50k-movie-reviews.zip.4be\IMDB Dataset.csv")

df['review'] = df['review'].astype(str)

print(df.head())
print(df.columns)
print(df["sentiment"].value_counts())
print("Total rows:", len(df))

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Index(['review', 'sentiment'], dtype='object')
sentiment
positive    25000
negative    25000
Name: count, dtype: int64
Total rows: 50000


In [16]:
total_reviews = len(df)
positive_reviews = len(df[df["sentiment"] == "positive"])

prior_positive = positive_reviews / total_reviews

print("Total reviews:", total_reviews)
print("Positive reviews:", positive_reviews)
print("P(Positive):", prior_positive)

Total reviews: 50000
Positive reviews: 25000
P(Positive): 0.5


In [17]:
keywords = ["awesome", "great", "incredible", "garbage", "worst", "bad"]

In [30]:
results = []

for keyword in keywords:
    
    # Likelihood P(keyword | Positive)
    positive_df = df[df["sentiment"] == "positive"]
    
    count_keyword_positive = sum(
        keyword in review.lower()
        for review in positive_df["review"]
    )
    
    likelihood = count_keyword_positive / len(positive_df)
    
    # Marginal P(keyword)
    count_keyword_total = sum(
        keyword in review.lower()
        for review in df["review"]
    )
    
    marginal = count_keyword_total / total_reviews
    
    # Posterior P(Positive | keyword)
    if marginal > 0:
        posterior = (likelihood * prior_positive) / marginal
    else:
        posterior = 0
    
    results.append((keyword, likelihood, marginal, posterior))

print(f"{'Keyword':<12} | {'Prior':<8} | {'Likelihood':<10} | {'Marginal':<10} | {'Posterior':<10}")
print("-" * 60)
for r in results:
    print(f"{r[0]:<12} | {prior_positive:<8.4f} | {r[1]:<10.4f} | {r[2]:<10.4f} | {r[3]:<10.4f}")

Keyword      | Prior    | Likelihood | Marginal   | Posterior 
------------------------------------------------------------
awesome      | 0.5000   | 0.0254     | 0.0177     | 0.7180    
great        | 0.5000   | 0.3712     | 0.2761     | 0.6722    
incredible   | 0.5000   | 0.0292     | 0.0197     | 0.7421    
garbage      | 0.5000   | 0.0050     | 0.0174     | 0.1447    
worst        | 0.5000   | 0.0164     | 0.0887     | 0.0927    
bad          | 0.5000   | 0.1303     | 0.2541     | 0.2565    
