# Genre Classification of TV Shows

**Submission deadline: Friday 26 May, 11:55pm**

**Assessment weight: 25% of the total unit assessment.**

*Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. The submission time for all uploaded assessments is **11:55 pm**. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for [Special Consideration](https://students.mq.edu.au/study/assessment-exams/special-consideration).*

In this assignment you will complete tasks for an end-to-end genre classification application. We will train and test the data using the TVmaze data set.

TVmaze is a free online television information database that provides users with detailed information about TV shows, their episodes, and their schedules. The website was launched in 2005 and has since grown to become one of the most comprehensive TV databases available.

### Genre

To reduce opportunities for copying and cheating, you will given a mostly-unique genre to work with. Add it to the next cell. Email Abid (or ask him when you see him) for your preference; if too many people have chosen that genre, he might ask you to pick again.

Possibilities are:

- Drama

- Comedy

- Romance

- Crime

- Action

- Adventure

- Anime

- Mystery

- History

- Children

- Thriller

- Fantasy

- Science-Fiction

- Family

- Food

- Music

- Travel

- Sports

- Nature

In [None]:
assigned_genre = 

To make sure everyone has unique data (even if you share a genre), several questions will ask you
to initialize a random number generator with `random_state_key`. Pick some number that is likely
to be unique to you (e.g. the digits from your student number).

In [None]:
# Replace 12345 with the digits from your student number, or some other number that is likely to be unique.
random_state_key = 12345

### Data

You will find a SQLite database (called `tvmaze.sqlite`) on iLearn. This is the data you will work from. Copy it into the same directory where you have put this jupyter notebook.

The following cell should create a connection for you.

In [None]:
import sqlite3

# Connect to the SQLite database
connection = sqlite3.connect("tvmaze.sqlite")

### Character Data

In a few places, you will be asked to run queries on the names of characters.

The following cell creates a dataframe called `characters_df` using this query:

`select tvmaze_id, tvmaze_character_id, name from tvmaze_casting join tvmaze_characters using (tvmaze_character_id);`

In [None]:
import pandas as pd

# Define the SQL query
sql_query = """
select tvmaze_id, tvmaze_character_id, name
from tvmaze_casting join tvmaze_characters
using (tvmaze_character_id);
"""

# Read the data into a pandas dataframe
characters_df = pd.read_sql_query(sql_query, connection)

If you aren't familiar with pandas, and just want to use raw `numpy`, you can use the `characters` array
created in the next cell.

In [None]:
characters = characters_df.to_numpy()
characters.shape, characters.dtype

And to make it a little easier, here are the character names extracted as a list.

In [None]:
character_names = list(characters[:,2])
character_names

### Show data

You might want to start with 200 shows so that your program runs faster, and then later on replace it with
500, or 1000 if you need more data.

In [None]:
data_size = 1000

The following cells create:

- A dataframe called `show_df` (if you are familiar with pandas)

- A numpy array called `shownames` (the names of the shows to work with)

- A numpy array called `descriptions` (which has the show descriptions)

- A numpy array called `in_genre` (whether this show is in your target genre or not)

- A numpy array called `tvmaze_ids` (the ID numbers of the shows).

In [None]:
out_of_genre_sql_query = f"""
select tvmaze_id, showname, description, 0.0 as in_genre from tvmaze
where tvmaze_id not in (select tvmaze_id from tvmaze_genre where genre = '{assigned_genre}')
      and description is not null
      and length(description) > 10"""

in_genre_sql_query = f"""select tvmaze_id, showname, description, 1.0 as in_genre from tvmaze
where tvmaze_id in (select tvmaze_id from tvmaze_genre where genre = '{assigned_genre}')
      and description is not null
      and length(description) > 10
"""

out_of_genre_df = pd.read_sql(out_of_genre_sql_query, connection)
in_genre_df = pd.read_sql(in_genre_sql_query, connection)
show_df = pd.concat([out_of_genre_df.sample(data_size, random_state=random_state_key), 
                     in_genre_df.sample(data_size, random_state=random_state_key)])
show_df

In [None]:
show_ids = show_df.tvmaze_id.to_numpy()
descriptions = show_df.description.to_numpy()
in_genre = show_df.in_genre.to_numpy()
shownames = show_df.showname.to_numpy()

## Task 1 (3 marks) - Regular expressions

### 1.1 (1 mark)

We all know that James Bond gets his gadgets from Q. Are there other shows where a character has a 
one-letter name?

Write a regular expression that matches a single upper-case letter, and use it to check against
the characters in `character_names`.

How many shows do you find?

### 1.2 (1 mark)

Write a regular expression that finds medical doctors. A medical doctor might be "Dr." or "Doctor" or "Dr".

Watch out for:

- JUDr. Augusta (who has a PhD in law)

- MUDr. Sova (who is a doctor)

- The Doctor (a science fiction character, who isn't a medical doctor)

- The Sixth Doctor (the same science fiction character, there are fifteen of them)

Assume that Dr. Death and Dr. Teeth are doctors.

### 1.3 (1 mark)

Write a regular expression to find Cyrillic alphabet character names.

## Task 2 (5 marks) - lexico-semantic preparation for a classifier

For this task only, consider the output of `nltk.word_tokenize()` to be 
what we mean by a "word". Be case insensitive (i.e. lowercase all
texts before processing).

### Task 2.1 (2 marks)

Calculate:

- two measures of the corpus size: the total number of words used in all descriptions, and the total number of TV shows

- the total number of distinct words in the descriptions (the vocabulary size)

- the average number of words in each description (i.e. the average document length)

- the average appearance count of each word (the hit ratio for search)

- the coefficients of Herdan's Law

Make a log-log plot to confirm that the data follows Herdan's Law. 

### 2.2 (1 mark)

Calculate the current ratio of distinct vocabulary items to documents, and compare it
to theoretical prediction from the formula:

$$
    \frac{C}{V} = \frac{ N^{1 - \beta}}{k L}
$$

Where

- $C$ is the number of *documents* in the corpus.
- $L$ is the average length of a document in the corpus
- $V$ is the number of distinct vocabulary items
- $N$ the number of words in the corpus
- $k$ and $beta$ are the values you derived in the previous exercise.

### 2.3 (0.5 marks)

Based on your answer to 2.2, you can reasonably expect that one of the
best ways to improve our classifier will be to add more documents.

If current trends continue, TVmaze will have information for a million
shows in should happen in 2045.

What would you expect for the following:

- $C/V$

- The total vocabulary size (using Herdan's Law)


### 2.4 (0.5 marks)

You will use this answer to tune the classifier in Task 4.

We should exclude happax legomena from the vocabulary, since they cannot be useful to the classifier.

How many words of vocabulary remain?

### 2.5 (1 mark)

We have mentioned Chollet's heuristic in class:

> It turns out that when approaching a new text-classification task, you should pay close attention to the ratio between the number of samples in your training data and the mean number of words per sample (see figure 11.11). If that ratio is small—less than 1,500—then the bag-of-bigrams model will perform better (and as a bonus, it will be much faster to train and to iterate on too). If that ratio is higher than 1,500, then you should go with a sequence model.

Calculate this ratio.

## Task 3 (5 marks) - Information retrieval

In this task you are going to create a naive search engine that will let you find a "similar"
TV show.

Here is a randomly-selected show for you to use in this section. You will also use the data
in `show_df`.

In [None]:
selected_show = show_df[show_df.in_genre == 1.0].sample(n=1, random_state=random_state_key)
selected_show.T

In [None]:
selected_show.iloc[0].description

### 3.1 (2 marks)

Vectorise the `description` of each show using TFIDF. 

- Vectorise words and bigrams

- Only include words and bigrams that appear twice

- Only include words and bigrams that appear in less than 50% of the descriptions

### 3.2 (1 mark)

Write code that shows the size of this new vocabulary (the total number of words and bigrams).

### 3.3 (2 marks)

Iterate over the shows that don't have your genre to find the show whose description is most
similar (using cosine similarity) to the show that was chosen for you.

That is, you should end up with a show that:

- Has a very similar description to the show described at the start of Task 3

- Belongs to a diffferent genre.



## Task 4 (10 marks) - detect genres

In this task, we'll be building a naively simple model for identifying TV genres.

### 4.1 (1 mark)

Use an sklearn function to break your dataset into a training set, and a test set. 

Set the random number initializer to your 
`random_state_key` so that this notebook always returns the same results.

### 4.2 (1 mark)

Create a vectorizer for your data, and prepare it on the descriptions in the training data. 

Set `max_tokens` to the value in your answer from 2.4 (plus 1 for the "unknown" token).

(The vectorizer you used in section 3.1 was trained on all data, not just your training data,
so cannot be re-used here without leaking test information into the training data.)

It should use TFIDF weighting. 

### 4.3 (1 mark)

Use the vectorizer to transform the training and test data

### 4.4 (2 marks)

We are creating a logistic regression model using Keras, which we will use
to predict the genre of a TV show based on its description.

Create a model based on the following:

- An input layer with a shape based on the size of the vocabulary from your vectorization.

- An output layer that uses a sigmoid activation function.

Compile your model (choose an appropriate loss, and add 'accuracy' as a metric) and display a summary of it.

### 4.5 (1 mark)

Fit the model to the training data. The target variable is `in_genre`.
Hold out 10% of the data as validation data. Stop when the loss in the 
validation data stops improving.

### 4.6 (1 mark)

Plot the training and validation loss and accuracy and confirm whether your model has begun to overfit.

### 4.7 (1 mark)

Report the accuracy of your model on the test data. It should be quite close to the
validation accuracy.

### 4.8 (1 mark)

Extract the weights from the logistic regression layer, and match them up with the words in the
vocabulary.

Identify any vocabulary that is strongly associated with being in-genre or with being out-of-genre.

### 4.9 (1 mark)

Create a description of a new show to pitch to someone at Macquarie's Film and Television School that is going to be strongly associated with your genre. If you are lacking inspiration, this is the kind of task that 
large language models do quite well.

Confirm that your model does correctly predict the genre.

## Task 5 (2 marks) - embeddings

### Task 5.1 (2 marks)

Create a new model with an embedding layer, compile it, fit it and evaluate its performance on the training data
set.

Don't worry if it doesn't improve the model performance --- based on 
your answers to 2.3 we would expect an embedding layer to make it much worse, and based on 2.5 we would
expect a sequence-to-sequence model to perform poorly as well!

# Submission

Your submission should consist of this Jupyter notebook with all your code and explanations inserted into the notebook as text cells. **The notebook should contain the output of the runs. All code should run. Code with syntax errors or code without output will not be assessed.**

**Do not submit multiple files. If you feel you need to submit multiple files, please contact greg.baker@mq.edu.au first.**

Examine the text cells of this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful [guide to the MarkDown notation](https://daringfireball.net/projects/markdown/syntax), which explains the format of the text cells.

Each task specifies a number of marks. The final mark of the assignment is the sum of all the marks of each individual task.

By submitting this assignment you are acknowledging that this is your own work. Any submissions that break the code of academic honesty will be penalised as per [the academic integrity policy](https://policies.mq.edu.au/document/view.php?id=3).

## A note on the use of AI code generators

We view AI code generators such as copilot, CodeGPT, etc as tools that can help you write code quickly. You are allowed to use these tools. If you choose to use them, make the following explicit:
- What part of your code is based on the output of such tools, 
- What tools you used,
- What prompts you used to generate the code, and
- What modifications you made on the generated code.

This will help us assess your work fairly.