# Project: Movie Classification
## Introduction
Welcome to the final project of Foundations of Data Science! In this Capstone Project, you will develop a classifier that can determine whether a movie is a comedy or a thriller based solely on the frequency of specific words in the movie's screenplay. This project will not only enhance your understanding of machine learning techniques but also reinforce the data science concepts you've learned throughout the course.

## Objectives
By the end of this project, you will be able to:

1. Build and understand the workings of a k-nearest-neighbors (KNN) classifier.
2. Test and evaluate the performance of your classifier on a dataset.
3. Create a presentation to communicate your journey through the data science life cycle to realize your final classifier

This project also offers a chance to review other essential data exploration and inference tools that you have learned in this course.

## Logistics
### Deadline
- **Latest Submission Date:** The project must be completed by 11:59 pm on Tuesday, May 14th. Students' presentations of their classifiers will be on Wednesday May15th during class.


### Support
- **Class Time:** Take advantage of class time for discussions with classmates or to seek clarification on project requirements.
- **Personal Support:** If you're feeling overwhelmed or stuck, reach out for a one-on-one discussion. Our goal is to make this learning experience both challenging and exciting.

### Testing
- **Preliminary Tests:** Passing these tests does not guarantee correctness; they typically check only the structure of your data.

- **Final Evaluation:** Additional tests will verify the accuracy of your submissions for grading. Review and verify your work carefully, then prepare a presentation that communicates the relevant insights discovered throughout your steps through the data science life cycle to realize your final classifier to the class. You may assume a persona for the class and tailor your presentation to that persona. For example, you could assume that your audience is the executive board of Netflix and you would like them to adopt your movie classifier. Be creative - the goal here is to think about how you communicate your results and the likely value and implication of your analysis to a target audience. (*Hint: You should probably think about the target audience before you begin your analysis - extra credit for creativity*)

### Advice
- Incremental Development: Break down complex tasks into manageable steps. Use separate lines for each step, name each result distinctly, and verify outcomes at each stage.


### Getting Started
We have included the workbook with the necessary libraries, make sure to load the necessary libraries:

- `datascience`
- `numpy`
- `plots`


We have also provided some "helper functions":
- plot_with_two_features
- fast_distances

Prepare your development environment by copying this Notebook and naming it according to the naming convention we used during term. You must upload it into your shared google drive before the deadline


In [None]:
# Run this cell to set up the notebook, but please don't change it.
import numpy as np
import math
from datascience import *

# These lines set up the plotting functionality and formatting.
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)



# 1. The Dataset

In this project, we are exploring movie screenplays. We'll be trying to predict each movie's genre from the text of its screenplay. In particular, we have compiled a list of 5,000 words that occur in conversations between movie characters. For each movie, our dataset tells us the frequency with which each of these words occurs in certain conversations in its screenplay. All words have been converted to lowercase.

Run the cell below to read the `movies` table. **It may take up to a minute to load.**

In [None]:
movies = Table.read_table('movies.csv')
movies.where("Title", "wild wild west").select(0, 1, 2, 3, 4, 14, 49, 1042, 4004)

The above cell prints a few columns of the row for the comedy movie *Wild Wild West*.  The movie contains 3446 words. The word "it" appears 74 times, as it makes up  $\frac{74}{3446} \approx 0.021364$ of the words in the movie. The word "england" doesn't appear at all.
This numerical representation of a body of text, one that describes only the frequencies of individual words, is called a bag-of-words representation. A lot of information is discarded in this representation: the order of the words, the context of each word, who said what, the cast of characters and actors, etc. However, a bag-of-words representation is often used for machine learning applications as a reasonable starting point, because a great deal of information is also retained and expressed in a convenient and compact format. In this project, we will investigate whether this representation is sufficient to build an accurate genre classifier.

All movie titles are unique. The `row_for_title` function provides fast access to the one row for each title.

*Note: All movies in our dataset have their titles lower-cased.*

In [None]:
title_index = movies.index_by('Title')
def row_for_title(title):
    """Return the row for a title, similar to the following expression (but faster)

    movies.where('Title', title).row(0)
    """
    return title_index.get(title)[0]

row_for_title('the terminator')

For example, the fastest way to find the frequency of "none" in the movie *The Terminator* is to access the `'none'` item from its row. Check the original table to see if this worked for you!

In [None]:
row_for_title('the terminator').item('none')

The dataset was extracted from [a dataset from Cornell University](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). After transforming the dataset (e.g., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of 5000 common words in each movie. Removing the columns:"Title", "Year", "Rating", "Genre", and "# Words" we see the 5000 words are associated with each movie and the number of movies.
Run the code below

In [None]:
print('Words with frequencies:', movies.drop(np.arange(5)).num_columns)
print('Movies with genres:', movies.num_rows)

#### Question 1.0
Set `expected_row_sum` to the number that you __expect__ will result from summing all proportions in each row, excluding the first five columns.

<!--
BEGIN QUESTION
name: q1_0
-->

In [None]:
# Set row_sum to a number that's the (approximate) sum of each row of word proportions.
expected_row_sum = ...

## 1.1. Word Stemming
The columns other than "Title", "Year", "Rating", "Genre", and "# Words" in the `movies` table are all words that appear in some of the movies in our dataset.  These words have been *stemmed*, or abbreviated heuristically, in an attempt to make different [inflected](https://en.wikipedia.org/wiki/Inflection) forms of the same base word into the same string.  For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing.

Stemming makes it a little tricky to search for the words you want to use, so we have provided another table that will let you see examples of unstemmed versions of each stemmed word.  Run the code below to load it.

In [None]:
# Just run this cell.
vocab_mapping = Table.read_table('stem.csv')
stemmed = np.take(movies.labels, np.arange(3, len(movies.labels)))
vocab_table = Table().with_column('Stem', stemmed).join('Stem', vocab_mapping)
vocab_table.take(np.arange(1100, 1110))

#### Question 1.1.1
Assign `stemmed_message` to the stemmed version of the word "vegetables".

<!--
BEGIN QUESTION
name: q1_1_1
-->

In [None]:
stemmed_message = ...
stemmed_message

#### Question 1.1.2
What stem in the dataset has the most words that are shortened to it? Assign `most_stem` to that stem.

<!--
BEGIN QUESTION
name: q1_1_2
-->

In [None]:
most_stem = ...
most_stem

#### Question 1.1.3
What is the longest word in the dataset whose stem wasn't shortened? Assign that to `longest_uncut`. Break ties alphabetically from Z to A (so if your options are "albatross" or "batman", you should pick "batman").

<!--
BEGIN QUESTION
name: q1_1_3
-->

In [None]:
# In our solution, we found it useful to first add columns with
# the length of the word and the length of the stem,
# and then to add a column with the difference between those lengths.
# What will the difference be if the word is not shortened?

tbl_with_lens = ...
tbl_with_dif = ...


longest_uncut = ...
longest_uncut

## 2. The Data Science Life Cycle


The Data Science Life Cycle is a framework comprising interconnected steps designed to systematically derive insights from data. Each step builds upon the previous one, ensuring a comprehensive approach to solving data-driven problems. In this project, the focus is on building a classifier to predict movie genres based on screenplay word frequencies. Below are the detailed steps of the Data Science Life Cycle as applied to this project:






























































































































































































































































































































































































































### Understanding the Problem
The first step in the Data Science Life Cycle is to understand the problem at hand. For this project, our goal is to develop a classifier that can predict the genre of a movie—specifically, distinguishing between comedies and thrillers—based on the frequency of certain words in the movie’s screenplay. This problem statement guides the subsequent steps in our analysis and modeling.

### Data Collection and Preparation
We begin by examining the "movies" table, which includes frequencies of 5,000 stemmed words across various movies, along with metadata such as the movie title, year, rating, genre, and total word count. Ensuring that our data is clean and appropriately formatted is crucial for effective analysis.

### Data Exploration & Inferential Analysis (Hypothesis formulation & Testing)
This stage involves getting familiar with the data through various exploratory analysis techniques. Look at the distribution of genres, the range of word counts, how genres might correlate with certain words, and other patterns that could inform the development of your classifier. Visualization and summary statistics are key tools for exploration.

The patterns you observed in your data may lead to formulating and testing some interesting hypothesis. For example, you may feel that there are more comedies typically have higher ratings than thrillers. In this case you would form

### Model Building and Evaluation
Using the insights gained from the exploration stage, you'll build a k-nearest-neighbors (KNN) classifier. This involves selecting features, choosing a distance metric, and deciding on the number of neighbors. You will then train your model on a subset of the data and test its performance on unseen data to evaluate its effectiveness.
### Communication of Results
The final step involves synthesizing your findings and the methodology into a coherent presentation aimed at stakeholders interested in your project. This presentation should cover the problem statement, your analytical approach, key findings, model performance, and potential implications of your results.


# 2.1 Exploring your data

Begin by exploring your dataset to better understand its characteristics and prepare for the analytical tasks ahead. Given the goal of developing a classifier to predict a movie's genre from its screenplay, it's important to examine the data from multiple perspectives. Remember that visualization is key here - employ visual tools to help uncover patterns in the data. Consider using histograms for ratings distributions by genre, scatter plots to visualize relationships between different variables, etc.

Here are some steps and considerations to guide your exploration:



- 2.1.1 Understand the Dataset Structure: Familiarize yourself with the structure of the "movies" table. Identify the total number of entries, the types of data columns available (e.g., numeric for word frequencies, categorical for genres), and any missing values or anomalies in the data.

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




- 2.1.2 Statistical Summaries: Generate summary statistics for the numeric columns such as '# Words' and the frequencies of some key stemmed words. This will give you insights into the central tendency and variability of your data, which are crucial for later stages of the project.


In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.



- 2.1.3 Genre Distribution: Analyze the distribution of movie genres within your dataset. Understanding how genres are represented can help in assessing whether the data is balanced or if there are biases that might affect model training.

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




- Document Insights so far: As you explore, document your findings and any insights that could influence your approach to building the classifier. This documentation will be useful for the subsequent phases of the project.-




### Exploration Insights So Far:

... \
... \
...

The remainder of this notebook presents specific questions and notes essential to guiding your analysis and fulfilling the core requirements of the project. Beyond these, however, you’re encouraged to be creative in your analysis approach. Dive into trends, formulate your own additional hypotheses, and perhaps uncover unexpected insights.




## 2.2 Inferential Analysis:

Assuming a 5% p-value cut-off, we are going to test different claims such as whether average ratings of comedies are different from average ratings of thrillers, whether proportion of comedies and thrillers are different between decades, etc.



- 2.2.1 Do longer movies receive have higher ratings on average than shorter movies? A longer movie is one in which the word count in the movie summary exceeds the average word count recorded in the dataset.

Hypothesis:

  - **Null Hypothesis (Ho):** ...

  - **Alternative Hypothesis (Ha):** ...

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




- 2.2.2 Now let's investigate the claim that there is a significant difference in the proportion of `comedy` and `thriller` movie genres released before and after the year `2000`.

Hypothesis:
  - **Null Hypothesis (Ho):** ...

  - **Alternative Hypothesis (Ha):** ...

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.






## 2.3  Exploring Relationship between words in proportions.

Let's look at the relationship between words in proportions.

The first association we'll investigate is the association between the proportion of words that are "outer" and the proportion of words that are "space".

As usual, we'll investigate our data visually before performing any numerical analysis.

Run the cell below to plot a scatter diagram of space proportions vs outer proportions and to create the `outer_space` table.

In [None]:
# Just run this cell!
outer_space = movies.select("outer", "space")
outer_space.scatter("outer", "space")
plots.axis([-0.001, 0.0025, -0.001, 0.005]);
plots.xticks(rotation=45);

#### Question 2.3.1
Looking at that chart it is difficult to see if there is an association. Determine if there is a true (non-random) association between proportion of words that are "outer" and the proportion of words that are "space" for every movie in the dataset.

<!--
BEGIN QUESTION
name: q1_2_1
-->

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.








<!-- BEGIN QUESTION -->

#### Question 2.3.2
Choose two *different* words in the dataset with a correlation higher than 0.2 or smaller than -0.2 that are not *outer* and *space* and plot a scatter plot with a line of best fit for them. The code to plot the scatter plot and line of best fit is given for you, you just need to calculate the correct values to `r`, `slope` and `intercept`.

*Hint: It's easier to think of words with a positive correlation, i.e. words that are often mentioned together*.

*Hint 2: Try to think of common phrases or idioms*.


<!--
BEGIN QUESTION
name: q1_2_2
manual: true
image: true
-->

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.
word_x = ...
word_y = ...








r = ...
slope = ...
intercept = ...

# DON'T CHANGE THESE LINES OF CODE
movies.scatter(word_x, word_y)
max_x = max(movies.column(word_x))
plots.title(f"Correlation: {r}, magnitude greater than .2: {abs(r) >= 0.2}")
plots.plot([0, max_x * 1.3], [intercept, intercept + slope * (max_x*1.3)], color='gold');

- Can you confirm that this relationship is not due to random chance, given a p-value cut-off of 0.05?

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.








Hopefully the exercise above gives you a hint of how you might determine  candidate features for your classifier.

<!-- END QUESTION -->



## 3. Building a classifier
Now it is time to use the `movies` dataset for the purposes of:

1.  *training* movie genre classifiers.
2.  *testing* the performance of our classifiers.

Hence, we need two different datasets: *training* and *test*.

The purpose of a classifier is to classify unseen data that is similar to the training data. Therefore, it is important to ensure that there are no movies that appear in both sets. The dataset has already been permuted randomly, so it's easy to split.  



### 3.1 Initial Set-up

3.1.1 Create a train data, `train_movies` that is 85% of `movies` and a test set, `test_movies` that is 15% of the data. Then use a bar chart to display the proportion of Comedy and Thriller in each dataset.

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.



train_movies = ...
test_movies = ...

print("Training: ",   train_movies.num_rows, ";",
      "Test: ",       test_movies.num_rows)

### 3.2 Initial Classifier Development
3.2.1. Start by building a few (at least two) different simple classifiers using only two or three features and k=1. This will help establish a baseline for your classifier's performance.


*Hint: It is advisable to develop several helper functions, especially one to encapsulate the classification process, which may improve the reusability and readability of your code, aiding in systematic experimentation and testing. Also you've done some work prior to this part of the project to ensure you are able to justify the variables you selected in your classifier(s).*

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




The function below creates a plot to display two features (e.g "water" and "feel") of a test movie and some training movies. As you can see in the result, *Monty Python and the Holy Grail* is more similar to "Clerks." than to the *The Avengers* based on these features, which is makes sense as both movies are comedy movies, while *The Avengers* is a thriller.

Feel free to adapt this for your use as necessary!

In [None]:
# Just run this cell.
def plot_with_two_features(test_movie, training_movies, x_feature, y_feature):
    """Plot a test movie and training movies using two features."""
    test_row = row_for_title(test_movie)
    distances = Table().with_columns(
            x_feature, [test_row.item(x_feature)],
            y_feature, [test_row.item(y_feature)],
            'Color',   ['unknown'],
            'Title',   [test_movie]
        )
    for movie in training_movies:
        row = row_for_title(movie)
        distances.append([row.item(x_feature), row.item(y_feature), row.item('Genre'), movie])
    distances.scatter(x_feature, y_feature, group='Color', labels='Title', s=30)

training = ["clerks.", "the avengers"]
plot_with_two_features("monty python and the holy grail", training, "water", "feel")
plots.axis([-0.001, 0.0011, -0.004, 0.008]);

### 3.3 Classifier Evaluation
3.3.1 Classify "Monty Python and the Holy Grail" and "The Avengers" using the five nearest neighbors in the different simple classifers you created above. Analyze the similarities and differences of their nearest neighbors.


In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.



3.3.2 If you modified one of your models to only use the features "water" and "feel", what are the names and genres of the 5 movies in the training set closest to Monty Python and the Holy Grail?

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




### 3.4 Optimizing Features
Using too little or all of the features in a classifer has some downsides. One clear downside is computational -- computing Euclidean distances just takes a long time when we have lots of features.

So we're going to select just a few no more than 20. We'd like to choose features that are very discriminative. That is, features which lead us to correctly classify as much of the test set as possible. This process of choosing features that will make a classifier work well is sometimes called feature selection, or, more broadly, feature engineering. Some things to consider as you select your features:

1. words common in both comedy and thriller movies
2. words uncommon in comedy movies and common in thriller movies
3. words common in comedy movies and uncommon in thriller movies
4. words uncommon in both comedy and thriller movies



To make calculating distances computationally faster, we have provided a function, `fast_distances`, to do this for you.  Read its documentation to make sure you understand what it does.  (Feel free to adapt and employ this function.)

In [None]:
# Just run this cell to define fast_distances.

def fast_distances(test_row, train_table):
    """Return an array of the distances between test_row and each row in train_rows.

    Takes 2 arguments:
      test_row: A row of a table containing features of one
        test movie (e.g., test_my_features.row(0)).
      train_table: A table of features (for example, the whole
        table train_my_features)."""
    assert train_table.num_columns < 50, "Make sure you're not using all the features of the movies table."
    counts_matrix = np.asmatrix(train_table.columns).transpose()
    diff = np.tile(np.array(list(test_row)), [counts_matrix.shape[0], 1]) - counts_matrix
    np.random.seed(0) # For tie breaking purposes
    distances = np.squeeze(np.asarray(np.sqrt(np.square(diff).sum(1))))
    eps = np.random.uniform(size=distances.shape)*1e-10 #Noise for tie break
    distances = distances + eps
    return distances

3.4.1 Engage in feature engineering and selection to refine your classifier.
- Build three models of varying complexity:
  - Small model: Use a minimal set of features ( less than 5).
  - Medium model: Use a moderate number of features for a balance between performance and complexity (10 features)
  - Large model: Use a comprehensive set of features for maximum discriminative power (20 features).

Analyze these models and determine the best of the three.

In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




### 3.5 Model Tuning
3.5.1. Further partition your training data to validate the optimal value of `k` for each model size (small, medium, large). This step involves creating a validation set from your existing training data to fine-tune the parameter `k` without using your test set.




In [None]:
### Begin your work here. Add as many code or text cells as you deem necessary.




### 4.  Final Steps
After building and comparing various models and tuning the k parameter across different complexities, you are well-equipped to craft a compelling story. Prepare a final presentation that outlines your journey through the data science cycle, the decisions you made, the models you compared, and the insights you gained (ex. patterns in the mistakes the different classifiers made, the improvement from your first classifier to subsequent ones, etc.)

This presentation should not only showcase your technical accomplishments but also your ability to convey complex information in an accessible and engaging manner.


## 5. Other Classification Methods (OPTIONAL)

**Note**: Everything below is **OPTIONAL**. Please only work on this part after you have finished and submitted the project. If you create new cells below, do NOT reassign variables defined in previous parts of the project.

Now that you've finished your k-NN classifier, you might be wondering what else you could do to improve your accuracy on the test set. Classification is one of many machine learning tasks, and there are plenty of other classification algorithms! If you feel so inclined, we encourage you to try any methods you feel might help improve your classifier.

We've compiled a list of blog posts with some more information about classification and machine learning. Create as many cells as you'd like below--you can use them to import new modules or implement new algorithms.

Blog posts:

* [Classification algorithms/methods](https://medium.com/@sifium/machine-learning-types-of-classification-9497bd4f2e14)
* [Train/test split and cross-validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
* [More information about k-nearest neighbors](https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7)
* [Overfitting](https://elitedatascience.com/overfitting-in-machine-learning)

In future data science classes, such as Data Science 100, you'll learn about some about some of the algorithms in the blog posts above, including logistic regression. You'll also learn more about overfitting, cross-validation, and approaches to different kinds of machine learning problems.

There's a lot to think about, so we encourage you to find more information on your own!

Modules to think about using:

* [Scikit-learn tutorial](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
* [TensorFlow information](https://www.tensorflow.org/tutorials/)

...and many more!