# What We're Predicting

One of the issues we’ve encountered with the datasets is the fact that the IMDb and TMDb genres do not always match up. We discussed a variety of options for dealing with this including. First, though, we agreed that there are two types of matching genres across the two data sources:
1. Exact matches - IMDb and TMDb have the same exact name for a type of movie.
2. Synonymous genres - IMDb list “Sci-Fi”, TMDb lists “Science Fiction”, and clearly these mean the same.
Therefore, the first step is to convert all genres names to one common language.

Then, we came up with the following options:
1. Unioning lists of genres together for each movie. For example, if IMDb lists only Comedy, and TMDb lists only Family, our result would be Comedy and Family.
2. Intersectioning lists of genres together for each movie. In other words, for a given movie, only include genres that are included on both websites.
3. Considering genre counts across each data source, with the idea being if both sources list a given genre, that should be of stronger predictive value than only one source listing a genre, but we can still keep genres not listed from both sources.
4. Using genres from just one data source, and completely ignoring the genre data from the other data source.

After thinking this through, we believe that #2 is the best option here. The two data sources appear to have fairly consistent genre classifications in the first place, and by taking the intersection of genres across them, we can be more confident in our predictions. In the small number of cases where there is no common genre across the two data sources, we will include the genres from both databases (alternative: get rid of the movie in our dataset).

To be clear, what we will be predicting is whether a given genre is listed for a movie on both IMDb and TMDb. We are therefore predicting a boolean for each genre for each movie.

# Unbalanced Data: Some Genres Are Overrepresented

Another issue to address is imbalanced genre frequencies. For example, from our exploratory data analysis, we saw that Drama is about 5 times as popular as Sci-Fi. We would not want our classifier to be heavily biased towards Drama, so we considered the following adjustments:
1. Randomly undersampling the majority classes to have the same sample size as the minority class.
2. Randomly oversampling the minority classes to have the same sample size as the majority class. This uses duplication of minority class examples, which seems less ideal than having distinct examples throughout the dataset. We believe that method 1 is more favorable than this method. We already will be dealing with a relatively large dataset, so oversampling minority classes (by duplication) in order to produce an even larger dataset seems unnecessary. Instead, we will have a training dataset where all samples are unique.
3. Building cost-sensitive classifiers that take into account the imbalanced nature of the data when making predictions. We can adjust the prediction thresholds upwards or downwards from 0.5. While this method makes sense statistically, it will require adjustment of every model we build. Given the scope, timeline, and collaborative nature of the project, it may slow things down significantly.
4. Guided undersampling of the majority classes: remove majority samples that are very similar to one another. A problem with this is that we don’t have an appropriate method of measuring distance between movies.
5. Synthetic oversampling of the minority classes: create artificial examples of the minority class. Why do this when we have a large dataset available?

For some of the reasons explained, we choose method 1: under sampling the majority classes.