## 🎯 Project Goal Recap

You’ll build a pipeline that:

- Ingests plot summaries (IMDb, Wikipedia, etc.)
- Compares their portrayed crime(s) with real-world data (FBI)
- Classifies:
    1. **What crime type is portrayed** in the media
    2. **If it's over- or underrepresented** compared to real-world data
- Deploys an app where users input a movie/show and get a “realism” analysis

| Step | Tool/Library |
| --- | --- |
| Data Collection | IMDb Datasets, TMDb API, Wikipedia API, FBI UCR/NIBRS data |
| Preprocessing | Python, Pandas, spaCy or NLTK |
| NLP Feature Extraction | TF-IDF, BERT via `transformers`, Sentence-BERT |
| ML Model | Scikit-learn (Logistic Regression, SVM), optionally XGBoost or LightGBM |
| Clustering | Scikit-learn (KMeans), Gensim (LDA) |
| Evaluation | Confusion Matrix, ROC-AUC, F1-Score |
| Visualization | Seaborn, Matplotlib, Plotly |
| Deployment | Streamlit (best for fast prototyping), Flask (for production apps) |
| Optional | HuggingFace, Docker, Weights & Biases (tracking) |

### **Phase 1: Data Collection & Cleaning**

**1. Collect Media Data**

- Scrape or download movie/show plot summaries:
    - Use IMDb datasets or TMDb API (`python-tmdb`)
    - Supplement with Wikipedia plot summaries via Wikipedia API

**2. Collect Crime Data**

- Use FBI UCR or NIBRS datasets (CSV available)
    - Clean data to group crimes into buckets: e.g., violent (murder, assault), property (theft, fraud), etc.

**3. Build a Combined Dataset**

- For each movie/show, label:
    - Primary crime shown (if available from metadata or you classify manually)
    - Year released
- For FBI data:
    - Aggregate crime frequency by type and year

In [1]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("movies_genres.csv", delimiter='\t')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117352 entries, 0 to 117351
Data columns (total 30 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   title        117352 non-null  object
 1   plot         117352 non-null  object
 2   Action       117352 non-null  int64 
 3   Adult        117352 non-null  int64 
 4   Adventure    117352 non-null  int64 
 5   Animation    117352 non-null  int64 
 6   Biography    117352 non-null  int64 
 7   Comedy       117352 non-null  int64 
 8   Crime        117352 non-null  int64 
 9   Documentary  117352 non-null  int64 
 10  Drama        117352 non-null  int64 
 11  Family       117352 non-null  int64 
 12  Fantasy      117352 non-null  int64 
 13  Game-Show    117352 non-null  int64 
 14  History      117352 non-null  int64 
 15  Horror       117352 non-null  int64 
 16  Lifestyle    117352 non-null  int64 
 17  Music        117352 non-null  int64 
 18  Musical      117352 non-null  int64 
 19  My