# A2_MCDA_5511
Assignment 2 for MCDA 5511

To get the code to run you must

1. Run "uv sync"
2. Select the python "3.10.16" kernal
3. Run .venv\Scripts\Activate
4. Run .venv\Scripts\python.exe -m ensurepip --upgrade
5. Run .venv\Scripts\python.exe -m pip install ipykernel

If you are on Mac you may need to switch every '\' to a '/' in the above commands.

## Question 1

### Sampling / Processing

We had originally looked into sampling our data to make it more manageable, however upon futher research we found that the downsides to doing so did not outweigh the benefits. Too much bias would be introduced into the training of the model, and the dataset was not big enough that the performance would increase substantially.

For an example of bias, If we used something like stratified sampling, the only way we found to divide up the data would be by looking at the movie genres. This causes issue because some movies fall into multiple genres, so looking at every possible combination would yeild 9033 unique classifications. Since groups must be mutually exclusive, stratified sampling would be of no use to us.

There could be some merit in clustered sampling, however we decided that our dataset would be small enough that sampling was not needed.

We instead chose to cut out some non relevent features to cut down on processing time instead. The features we deemed relevant for this model were the title, release date, genre(s), overview, popularity and revenue.

### Statistics
We ran some calculations to find some basic statistics of our data set. Here is a graph showing the spread of document length:

![document_length_graph](document_length.png)

We found that the average length of an overview was 272.23 characters. This only involved the data for the entry in the data set, the actual data passed to the model was longer since we processed all the features into one continuous string. In reality this equated to about 75 words per document on average after data processing.

As for vocabulary, we ran 2 tests. One to find the most common words, and one to find the most common words (excluding stopwords):

![top_10_words](top_10.png)

![top_10_no_stopwords](top_10_no_stopwords.png)

### Topics

The main topics covered in this dataset include statistics surrounding the top 10000 movies on TMDB. This includes topics such as its title, revenue, an overview, popularity, etc. Its important to note that not all movies have data for each feature, but every movie at least has a title. We found that 78 of the movies did not have overviews, 24 did not have a release date, 2 did not have a popularity rating, and 2 did not have a revenue.

In [107]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import sklearn

# Load the dataset
original_df = pd.read_csv('top_10000_popular_movies_tmdb.csv')

# Select the relevant columns
df = original_df[['title', 'release_date', 'genres', 'overview', 'popularity', 'revenue']]

# Create a new row with null values
null_row = pd.DataFrame([{
    'title': 'Null movie',
    'release_date': float('nan'),
    'genres': "[]",
    'overview': float('nan'),
    'popularity': float('nan'),
    'revenue': 0
}])

df = pd.concat([null_row, df], ignore_index=True)

# Finds missing values in the dataset
for feature in df.columns:
    missing_data = df[df[feature].isnull()]
    print(f"Number of missing values in {feature}: {len(missing_data)}")

# Convert each row to a formatted string and store in a new column
df['formatted_string'] = df.apply(
    lambda row: f"{row['title']}, released on {'an unknown date' if pd.isnull(row['release_date']) else row['release_date']}, is a {'movie with unknown genre(s)' if row['genres'] == '[]' else f'{row['genres']} movie'} with a plot of: {'movie\'s plot is unknown.' if pd.isnull(row['overview']) else f'that is about {row['overview']}'} It has a popularity score of {'an unknown amount' if pd.isnull(row['popularity']) else row['popularity']}, assigned to the movie by TMDB based on user engagement. It generated {'an unknown amount' if row['revenue'] == 0 else f'{row['revenue']} USD'} in revenue.",
    axis=1
)


Number of missing values in title: 0
Number of missing values in release_date: 24
Number of missing values in genres: 0
Number of missing values in overview: 78
Number of missing values in popularity: 2
Number of missing values in revenue: 2


In [108]:
# Load the model
model_BAAI = SentenceTransformer('BAAI/bge-small-en')

# Encode the formatted strings
embeddings = pd.Series(df["formatted_string"]).apply(lambda x: model_BAAI.encode(str(x)))

# Add the embeddings to the DataFrame
df["embedding"] = embeddings


In [153]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_flan = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Define the query
query = "Give me the plot of the super mario bros movie."
#query = "Which movie was released first? the Dark Knight or Crater"
#query = "how much revenue did the movie the dark knight make?"

query_embedding = model_BAAI.encode(query)

# Compute cosine similarity for each formatted string
similarity_dict = {}
for index, row in df.iterrows():
    cosine_sim = sklearn.metrics.pairwise.cosine_similarity([row["embedding"]], [query_embedding])[0][0]
    similarity_dict[row["formatted_string"]] = cosine_sim

# Sort the similarities in descending order and get the top k formatted strings
k = 2
top_k_strings = sorted(similarity_dict.items(), key=lambda item: item[1], reverse=True)[:k]

# Print the top k formatted strings with their cosine similarities
print(f"Top {k} formatted strings with the highest cosine similarities:")
for formatted_string, similarity in top_k_strings:
    print(f"Movie: {formatted_string}")
    print(f"Cosine similarity: {similarity}\n")

# Use the formatted string with the highest cosine similarity in the instruction prompt
instruction_prompt = f"Based on the following information, {query}: {top_k_strings} don't make up any new infotmation just use the information given."

inputs = tokenizer(instruction_prompt, return_tensors="pt")
outputs = model_flan.generate(**inputs)
print(f"{query}:")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Top 2 formatted strings with the highest cosine similarities:
Movie: The Super Mario Bros. Movie, released on 2023-04-05, is a ['Animation', 'Family', 'Adventure', 'Fantasy', 'Comedy'] movie with a plot of: that is about While working underground to fix a water main, Brooklyn plumbers—and brothers—Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi. It has a popularity score of 3394.458, assigned to the movie by TMDB based on user engagement. It generated 1308766975.0 USD in revenue.
Cosine similarity: 0.870617151260376

Movie: Super Mario Bros., released on 1993-05-28, is a ['Adventure', 'Fantasy', 'Comedy', 'Family', 'Science Fiction'] movie with a plot of: that is about Mario and Luigi, plumbers from Brooklyn, find themselves in an alternate universe where evolved dinosaurs live in hi-tech squalor. They're the only hope to save our universe from invasion by the d