# Movie Curiosities: Data Wrangling

This project can be considered the 'back-end' of my project *Movie curiosities: an IMDB exploration.* The `imdb_final.csv` dataset used in that project was created using different datasets that required a lot of cleaning and tinkering around. Here we'll go through the entire process. (If you want to read the exploration part instead, it's available both [with code](https://github.com/NicolaBagala/portfolio/blob/master/data/imdb/imdb_exploration.ipynb) and [code-free](https://github.com/NicolaBagala/portfolio/blob/master/data/imdb/codefree/imdb_exploration_codefree.ipynb).)

The datasets we'll use are:

- `imdb_movies.csv` contains data about 85,855 movies listed on IMDB, such as title, year of release, average score, budget, etc. I originally downloaded this from Kaggle, but since then, the [source](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset) is no longer available for some reason.
-  `imdb_ratings.csv` contains rating data about the same 85,855 movies listed in the dataset above, such as total votes, mean scores, votes by age and sex, etc. (As above, the original Kaggle [source](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset) is no longer available.)
- `hist_CPIs.csv` contains historical values of the consumer product index relative to the US dollar from 1913 to 2021. Used for currency conversion. I created this dataset from [this table](https://www.minneapolisfed.org/about-us/monetary-policy/inflation-calculator/consumer-price-index-1913-).

Being `imdb_movies.csv` and `imdb_ratings.csv` the two main data sources, this project consists of two wrangling and cleaning phases, one for each, after which they'll be merged.

## First dataset: `imdb_movies`

In [1]:
import pandas as pd
imdb = pd.read_csv("imdb_movies.csv", header = 0)                                                              

  exec(code_obj, self.user_global_ns, self.user_ns)


As we'll see in a moment, the warning above is due to mixed value types in the `year` column; some values are numbers, others are strings. We'll fix it later; for now, let's explore the dataset a bit more.

In [2]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_title_id          85855 non-null  object 
 1   title                  85855 non-null  object 
 2   original_title         85855 non-null  object 
 3   year                   85855 non-null  object 
 4   date_published         85855 non-null  object 
 5   genre                  85855 non-null  object 
 6   duration               85855 non-null  int64  
 7   country                85791 non-null  object 
 8   language               85022 non-null  object 
 9   director               85768 non-null  object 
 10  writer                 84283 non-null  object 
 11  production_company     81400 non-null  object 
 12  actors                 85786 non-null  object 
 13  description            83740 non-null  object 
 14  avg_vote               85855 non-null  float64
 15  vo

### Dropping unnecessary columns and renaming others

Many of the columns aren't needed for the purposes of the main project, so we can drop them to make the dataset more manageable. Namely:
- We can drop `title` and keep `original_title` instead.
- `date_published` wouldn't be very useful, because every movie has several different publishing dates, depending on the country where it is released. This column only shows one such date; in contrast, according to [IMDB's FAQs](https://help.imdb.com/article/contribution/titles/release-dates/GVUUDEPJNAW6G35P#), `year` refers to the year of the movie's first release *ever*, so we can keep this and drop `date_published`.
- `language`, `director`, `writer`, `production_company`, `actors`,  `description`, `reviews_from_users`, and `reviews_from_critics` aren't relevant for any of the questions I set out to answer in the main project, so off they go.
- `usa_gross_income` is in general less informative than `worlwide_gross_income` and contains many more null values, so let's drop it and keep the latter instead.
- To keep things simple, we'll work only with IMDB scores, so `metascore` can be dropped as well. 


However, before dropping anything, let's make a backup copy, in case we'll ever need to have a look at the entire dataset later on.

In [3]:
# Back up the dataset before dropping columns.
imdb_backup = imdb.copy()

# Drop unnecessary columns.
cols_to_drop = (["title", "date_published", "language", "director",
                 "writer", "production_company", "actors", "description", "usa_gross_income", 
                 "metascore", "reviews_from_users", "reviews_from_critics"])
imdb.drop(columns = cols_to_drop, inplace = True)
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_title_id          85855 non-null  object 
 1   original_title         85855 non-null  object 
 2   year                   85855 non-null  object 
 3   genre                  85855 non-null  object 
 4   duration               85855 non-null  int64  
 5   country                85791 non-null  object 
 6   avg_vote               85855 non-null  float64
 7   votes                  85855 non-null  int64  
 8   budget                 23710 non-null  object 
 9   worlwide_gross_income  31016 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 6.6+ MB


The names of some of the columns are too long or not sufficiently informative, so let's change them to something better. Also, let's capitalise them to more easily distinguish them from the rest.

In [4]:
new_names = {"imdb_title_id": "ID",
             "original_title": "TITLE",
             "year": "RELEASE_YEAR", 
             "duration":"LENGTH_MIN", # _MIN is just to be sure I remember how it's measured! 
             "avg_vote": "AVG_SCORE", # See below about this change
             "worlwide_gross_income": "GLOBAL_GROSS"}

imdb.rename(columns = new_names, inplace = True)
imdb.columns = imdb.columns.str.upper() # Capitalise everything, because I didn't manually change every name
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            85855 non-null  object 
 1   TITLE         85855 non-null  object 
 2   RELEASE_YEAR  85855 non-null  object 
 3   GENRE         85855 non-null  object 
 4   LENGTH_MIN    85855 non-null  int64  
 5   COUNTRY       85791 non-null  object 
 6   AVG_SCORE     85855 non-null  float64
 7   VOTES         85855 non-null  int64  
 8   BUDGET        23710 non-null  object 
 9   GLOBAL_GROSS  31016 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 6.6+ MB


It should be noted that I changed the column `avg_vote` to `AVG_SCORE`. The reason for that was to avoid confusion later on, when handling the `imdb_ratings` dataset. On IMDB, every movie has an average score from 1 to 10, which is based on how the users scored that movie. For example, say that ten users give a certain movie score 5, two  give it score 8, and three give it score 9; in total, the movie has received 15 *votes*, which the IMDB backend then uses to compute an average *score* for the movie. Both the `imdb` dataset and the `imdb_ratings` dataset conflate the concept of "score" and "user vote", which I find confusing, hence the renaming.

### Handling problematic values

In this section, we'll handle missing values and any other value that causes or might cause trouble later on.

#### Release years

`RELEASE_YEAR` triggered a warning right upon loading the dataset. As said, that's because it contains a mix of integers and strings; one in particular definitely can't be typecasted to a numeric type.

In [5]:
imdb["RELEASE_YEAR"].unique()

array([1894, 1906, 1911, 1912, 1919, 1913, 1914, 1915, 1916, 1917, 1918,
       1920, 1921, 1924, 1922, 1923, 1925, 1926, 1935, 1927, 1928, 1983,
       1929, 1930, 1932, 1931, 1937, 1938, 1933, 1934, 1936, 1940, 1939,
       1942, 1943, 1941, 1948, 1944, 2001, 1946, 1945, 1947, 1973, 1949,
       1950, 1952, 1951, 1962, 1953, 1954, 1955, 1961, 1956, 1958, 1957,
       1959, 1960, 1963, 1965, 1971, 1964, 1966, 1968, 1967, 1969, 1976,
       1970, 1979, 1972, 1981, 1978, 2000, 1989, 1975, 1974, 1986, 1990,
       2018, 1977, 1982, 1980, 1993, 1984, 1985, 1988, 1987, 2005, 1991,
       2002, 1994, 1992, 1995, 2017, 1997, 1996, 2006, 1999, 1998, 2007,
       2008, 2003, 2004, 2010, 2009, 2011, 2013, 2012, 2016, 2015, 2014,
       2019, 2020, '2012', '2015', '2009', '2013', '2018', '2014', '2017',
       '2011', '2016', '1981', '1975', '2010', '1984', '2007', '2006',
       '2001', '2004', '1979', '2019', '1967', '1978', '2003', '2005',
       '1969', '1990', '1983', '2002', '1996', '2008'

The uncastable culprit is "TV Movie 2019", at the very bottom of the output above.

In [6]:
imdb.query("RELEASE_YEAR == 'TV Movie 2019'")

Unnamed: 0,ID,TITLE,RELEASE_YEAR,GENRE,LENGTH_MIN,COUNTRY,AVG_SCORE,VOTES,BUDGET,GLOBAL_GROSS
83917,tt8206668,Bad Education,TV Movie 2019,"Biography, Comedy, Crime",108,USA,7.1,23973,,


The problematic value affects only one entry, [*Bad Education*](https://en.wikipedia.org/wiki/Bad_Education_(2019_film)), which was indeed released in 2019 as a TV movie. The fact that it's a TV movie isn't relevant for the scope of the analysis, so let's just correct it to `2019`, and then convert the whole column to `dtype: int`.

In [7]:
imdb.at[83917, "RELEASE_YEAR"] = 2019
imdb["RELEASE_YEAR"] = imdb["RELEASE_YEAR"].astype(int)
imdb["RELEASE_YEAR"].unique()

array([1894, 1906, 1911, 1912, 1919, 1913, 1914, 1915, 1916, 1917, 1918,
       1920, 1921, 1924, 1922, 1923, 1925, 1926, 1935, 1927, 1928, 1983,
       1929, 1930, 1932, 1931, 1937, 1938, 1933, 1934, 1936, 1940, 1939,
       1942, 1943, 1941, 1948, 1944, 2001, 1946, 1945, 1947, 1973, 1949,
       1950, 1952, 1951, 1962, 1953, 1954, 1955, 1961, 1956, 1958, 1957,
       1959, 1960, 1963, 1965, 1971, 1964, 1966, 1968, 1967, 1969, 1976,
       1970, 1979, 1972, 1981, 1978, 2000, 1989, 1975, 1974, 1986, 1990,
       2018, 1977, 1982, 1980, 1993, 1984, 1985, 1988, 1987, 2005, 1991,
       2002, 1994, 1992, 1995, 2017, 1997, 1996, 2006, 1999, 1998, 2007,
       2008, 2003, 2004, 2010, 2009, 2011, 2013, 2012, 2016, 2015, 2014,
       2019, 2020])

#### Lengths

In order to make sure there aren't any absurd movie lengths, let's check the minimum and maximum values of `LENGTH_MIN`:

In [8]:
print("Min: {}; Max: {}".format(imdb["LENGTH_MIN"].min(),imdb["LENGTH_MIN"].max()))

Min: 41; Max: 808


41 minutes is not at all a weird length, but 808 definitely is. What is it?

In [9]:
imdb.loc[imdb["LENGTH_MIN"] == 808, "TITLE"]

85057    La flor
Name: TITLE, dtype: object

Apparently, a movie that long [actually exists](https://en.wikipedia.org/wiki/La_Flor). Given that, there's no reason to doubt other lengths, because they're all shorter.

#### Possible misspellings

The only text columns are `TITLE`, `GENRE`, and `COUNTRY`. Any misspellings in the `TITLE` column don't really matter, as we'll keep titles only for reference; however, errors in the other two columns might be problematic, for examples if the same country had multiple (mis)spellings, such as "USA" and "US", or if the same genre was spelled "Drama" in one place and "Drma" in another. To see whether there's any issue like that, let's print out each single, unique genre and country and inspect them visually. The two utility functions below will help us do that.

In [10]:
def string_to_list(string_list):
    """Turns a string-list such as "A, B, C" into an actual list object, ["A", "B", "C"]
    `string_list` is a string to be turned to an actual list.
    """    
    # If string_list doesn't contain a comma, it's a single entry.
    if "," not in string_list: return [string_list]        
    
    # If string_list does contain a comma, it's a string-list of entries.
    string_list = string_list.split(",")
    actual_list = []
    for item in string_list:
        actual_list.append(item.strip())
    return actual_list
    
def add_to_set(orig_list, dest_set):
    """Adds every item of a list to a set, to guarantee uniqueness.
    `orig_list` is a list (or iterable).
    `dest_set` is the destination set to which items of `orig_list` must be added.
    """
    for item in orig_list:
        dest_set.add(item)

As we've seen above, `GENRE` doesn't have any nulls, nor any particularly problematic values:

In [11]:
# Find unique genres across the dataset.
unique_genre_strings = pd.Series(imdb["GENRE"].unique()) 
unique_genre_lists = unique_genre_strings.map(string_to_list)

unique_genres = set()
unique_genre_lists.apply(add_to_set, args = (unique_genres,)) 
unique_genres

{'Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

I don't like the hyphenation of a couple of genres, so I'll change that.

In [12]:
imdb["GENRE"] = imdb["GENRE"].map(lambda x: x.replace("Film-Noir", "Film noir"))
imdb["GENRE"] = imdb["GENRE"].map(lambda x: x.replace("Reality-TV", "Reality TV"))

Before we can move on to look for spelling issues with the country names, we have to fix the missing values in the `COUNTRY` column, which if left alone would break the function `string_to_list`. (`GENRE` didn't have any nulls, which is why we didn't have to worry about that before.)

In [13]:
print("Missing COUNTRY values: {}".format(imdb["COUNTRY"].isna().sum()))

Missing COUNTRY values: 64


Seeing as how there are "only" 64 movies with a missing values for `COUNTRY`, for it's tempting to look them up and write down the country; I tried that, but they turned out to be mostly indie, obscure movies for which finding the country would be a time sink. It's best to mark them as `(Missing)` instead and move on with the misspelling check.

In [14]:
imdb["COUNTRY"].fillna("(Missing)", inplace = True)

unique_country_strings = pd.Series(imdb["COUNTRY"].unique())
unique_country_lists = unique_country_strings.map(string_to_list)

unique_countries = set()
unique_country_lists.apply(add_to_set, args = (unique_countries,)) 
unique_countries

{'(Missing)',
 'Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Belarus',
 'Belgium',
 'Belize',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Czechoslovakia',
 "Côte d'Ivoire",
 'Denmark',
 'Djibouti',
 'Dominican Republic',
 'East Germany',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Estonia',
 'Ethiopia',
 'Faroe Islands',
 'Federal Republic of Yugoslavia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Guadeloupe',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',

There are only two names that could be shortened. No other obvious issues.

In [15]:
imdb["COUNTRY"] = imdb["COUNTRY"].map(lambda x: x.replace("Holy See (Vatican City State)", "Vatican City"))
imdb["COUNTRY"] = imdb["COUNTRY"].map(lambda x: x.replace("The Democratic Republic Of Congo", "Democratic Republic Of Congo"))

#### Redundancies

The columns `GENRE` and `COUNTRY` might contain multiple values for the same entry, because a movie can have several genres and be the product of a multi-country collaboration. For example:

In [16]:
imdb["GENRE"].head()

0                      Romance
1      Biography, Crime, Drama
2                        Drama
3               Drama, History
4    Adventure, Drama, Fantasy
Name: GENRE, dtype: object

In [17]:
imdb["COUNTRY"].head()

0                 USA
1           Australia
2    Germany, Denmark
3                 USA
4               Italy
Name: COUNTRY, dtype: object

The problem with that is that two such lists could be conceptually identical while being technically different; for example, `Comedy, Drama` and `Drama, Comedy` identify the same two genres, but they are two different strings. There are cases like this in both the `GENRE` and `COUNTRY` columns.

In [18]:
comedy_drama = imdb.loc[imdb["GENRE"] == "Comedy, Drama", "GENRE"].size
drama_comedy = imdb.loc[imdb["GENRE"] == "Drama, Comedy", "GENRE"].size

print("'Comedy, Drama':", comedy_drama)
print("'Drama, Comedy':", drama_comedy)

'Comedy, Drama': 4039
'Drama, Comedy': 124


In [19]:
ger_den = imdb.loc[imdb["COUNTRY"] == "Germany, Denmark", "COUNTRY"].size
den_ger = imdb.loc[imdb["COUNTRY"] == "Denmark, Germany", "COUNTRY"].size

print("'Germany, Denmark':", ger_den)
print("'Denmark, Germany':", den_ger)

'Germany, Denmark': 7
'Denmark, Germany': 10


We can fix that by sorting each list alphabetically, using the utility function below.

In [20]:
def alphasort_string(string):
    """Takes in a string of the form "item1, item2, item3, ..." and returns a string
    where the item1, item2, item3, ..., have been sorted alphabetically.
    
    `string` is a string whose comma-separated entries need to be sorted alphabetically.
    """
    items = string.split(",")    
    sorted_items = []
    for item in items:
        stripped_item = item.strip() 
        sorted_items.append(stripped_item)  
        
    sorted_items.sort()
    # Convert `sorted_items` to a string, and eliminate any extra brackets or single-quotes left. 
    sorted_items = str(sorted_items).strip("[]").replace("'", "") 
    return sorted_items

# Sort all lists in the `GENRE` and `COUNTRY` columns alphabetically. 
imdb["GENRE"] = imdb["GENRE"].map(alphasort_string)
imdb["COUNTRY"] = imdb["COUNTRY"].map(alphasort_string)

To make sure it worked, let's check how many entries whose `GENRE` is `Comedy, Drama` or `Drama, Comedy` there are now. If everything worked as expected, all the 124 `Drama, Comedy` entries we had before should now be `Comedy, Drama` entries, whose total should be `4039 + 124 = 4163`. The total of `Drama, Comedy` entries should instead be zero.

In [21]:
comedy_drama = imdb.loc[imdb["GENRE"] == "Comedy, Drama", "GENRE"].size
drama_comedy = imdb.loc[imdb["GENRE"] == "Drama, Comedy", "GENRE"].size

print("'Comedy, Drama':", comedy_drama)
print("'Drama, Comedy':", drama_comedy)

'Comedy, Drama': 4163
'Drama, Comedy': 0


#### Duplicates

Before doing any further work, it's best to make sure there are no duplicate entries. `ID`s should be different no matter what, so let's make sure that's the case.

In [22]:
imdb["ID"].duplicated().value_counts()

False    85855
Name: ID, dtype: int64

There are no duplicated IDs, but that doesn't mean there can't be duplicated *movies*. It's very much possible for two different movies to have the same title, so checking titles alone just won't cut it. It's also very much possible, however unlikely, that two different movies have the same `AVG_SCORE` or `VOTES`; `BUDGET` and `GLOBAL_GROSS` have far too many nulls to be useful to spot duplicates, so our best bet is to look for any movies that have the same `TITLE`, `RELEASE_YEAR`, `GENRE`, `LENGTH_MIN`, and `COUNTRY`. To prevent false positives or false negatives when checking the `TITLE`, we're going to make all titles lowercase first.

In [23]:
imdb_duplicate_check = imdb[["TITLE", "RELEASE_YEAR", "GENRE", "LENGTH_MIN", "COUNTRY"]].copy()

# Format all titles uniformly.
imdb_duplicate_check["TITLE"] = imdb_duplicate_check["TITLE"].map(str.lower).map(str.strip)

# Check for duplicates over all fields. Find and report each and every instance of a potential duplicate `(keep = False)`.
duplicates = imdb_duplicate_check.duplicated(keep = False)
duplicates.value_counts()

False    85853
True         2
dtype: int64

There are two potential duplicates; let's look them up in the original dataset through the index.

In [24]:
duplicate_ids = duplicates[duplicates == True].index
imdb.loc[duplicate_ids, :]

Unnamed: 0,ID,TITLE,RELEASE_YEAR,GENRE,LENGTH_MIN,COUNTRY,AVG_SCORE,VOTES,BUDGET,GLOBAL_GROSS
64199,tt2069797,Delirium,2018,"Horror, Thriller",96,USA,5.7,6114,,
69971,tt3131050,Delirium,2018,"Horror, Thriller",96,USA,3.2,739,$ 3000000,


Interestingly, there actually *are* two different movies from 2018 called *Delirium*, both from the USA, a duration of 96 minutes, and the same genre. ([Here](https://www.imdb.com/title/tt2069797/?ref_=fn_tt_tt_1) and [here](https://www.imdb.com/title/tt3131050/?ref_=fn_tt_tt_8) on IMDB.) Still, to be absolutely sure that the two rows above really do refer to different movies, we can look them up in the backup copy of the IMDB dataset that contains more columns.

In [25]:
imdb_backup.loc[duplicate_ids, :]

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
64199,tt2069797,Delirium,Delirium,2018,2018-05-22,"Horror, Thriller",96,USA,English,Dennis Iliadis,...,"Genesis Rodriguez, Topher Grace, Patricia Clar...",A man recently released from a mental institut...,5.7,6114,,,,,102.0,23.0
69971,tt3131050,Delirium,Delirium,2018,2018-01-19,"Horror, Thriller",96,USA,English,Johnny Martin,...,"Mike C. Manning, Griffin Freeman, Ryan Pinksto...",A group of young men dare a classmate to reach...,3.2,739,$ 3000000,,,27.0,19.0,6.0


The two entries have different publishing dates, directors, and actors, so at this point we're sure that they're two different movies and that there shouldn't be any duplicates in the dataset.

### Handling currency columns

Some of the questions I set out to answer require converting all amounts in the `BUDGET` and `GLOBAL_GROSS` columns into today's (2021, at the time of writing), inflation-adjusted USD (United States Dollars). We're operating on the (reasonable) assumption that the values in the `BUDGET` and `GLOBAL_GROSS` columns are relative to the year the movie was released. So, if a movie was released in 1950 and had a budget of, say, USD 50,000 and grossed USD 200,000 globally, we'd be talking 1950's USD, and similarly for any other currency. 

Adjusting for inflation from year $y$ to today isn't difficult; as explained [here](https://www.officialdata.org/us/inflation/1800?amount=1#formulas), it's enough to multiply each amount by 

$$
r^{t}_{y}=\frac{CPI_t}{CPI_y} 
$$

where ${CPI}_t$ is today's [*consumer price index*](https://en.wikipedia.org/wiki/Consumer_price_index) and ${CPI}_y$ is the CPI from a past year $y$. In general, the value of $r^{t}_{y}$ is different for different currencies, because the rate of inflation experienced by different countries is different; so, we might need several different historical CPI records, depending on how many different currencies appeared in the `BUDGET` and `GLOBAL_GROSS` columns. (Converting every amount to a single, "bridge" currency, adjusting those amounts for inflation and converting back to the original currency doesn't work, precisely because inflation grows differently in different countries.)

So, the next question is: how many currencies are there, and what format are they expressed in?

#### Currency format

The `dtype` of both the `BUDGET` and `GLOBAL_GROSS` columns is `object`, and their non-null values appear to be composed of a currency symbol followed by a single space and then a figure. Most of the currency symbols are `$` signs, but the rest are three-letter strings, like `EUR` or `NOK`. Let's make sure this is the case.

In [26]:
# Regex: match strings that begin EITHER with a single $ OR with exactly three letters, followed by a single space and any number of figures.
pattern = r"(^[$]|^[^0-9\s$]{3}) [0-9]+"

# Consider nulls as True, so that any False will be each and every values that are NOT null AND do not match the pattern.
budget_matches = imdb["BUDGET"].str.fullmatch(pattern, na = True).value_counts()[True]
print("Budget values that match the pattern OR are null: {} out of {}".format(budget_matches, len(imdb)))      
      
gross_matches = imdb["GLOBAL_GROSS"].str.fullmatch(pattern, na = True).value_counts()[True]
print("Gross values that match the pattern OR are null: {} out of {}".format(gross_matches, len(imdb)))

Budget values that match the pattern OR are null: 85855 out of 85855
Gross values that match the pattern OR are null: 85855 out of 85855


The `$` sign might be problematic, so it's best to replace it with the string `USD`. (This also makes our dataset more consistent.)

In [27]:
imdb["BUDGET"] = imdb["BUDGET"].str.replace(r"$", "USD", regex = False)
imdb["GLOBAL_GROSS"] = imdb["GLOBAL_GROSS"].str.replace(r"$", "USD", regex = False)

Now we can look at both columns to see what are all the currencies we're dealing with.

In [28]:
# Regex: extract anything that is NOT a number or a whitespace. (So, currency symbols.)
currency_pattern = r"([^0-9\s]+)"

# This function looks too simple to bother with, but we'll use it again.
def print_all_currencies():
    """Prints all unique currencies across both `BUDGET` and `GLOBAL_GROSS` columns. """    
    budget_currencies = imdb["BUDGET"].str.extract(currency_pattern, expand = False)
    gross_currencies = imdb["GLOBAL_GROSS"].str.extract(currency_pattern, expand = False)
    all_currencies = pd.concat([budget_currencies, gross_currencies], axis = 0).unique()
    print(all_currencies)
    
print_all_currencies()

[nan 'USD' 'ITL' 'ROL' 'SEK' 'FRF' 'NOK' 'GBP' 'DEM' 'PTE' 'FIM' 'CAD'
 'INR' 'CHF' 'ESP' 'JPY' 'DKK' 'NLG' 'PLN' 'RUR' 'AUD' 'KRW' 'BEF' 'XAU'
 'HKD' 'NZD' 'CNY' 'EUR' 'PYG' 'ISK' 'IEP' 'TRL' 'HRK' 'SIT' 'PHP' 'HUF'
 'DOP' 'JMD' 'CZK' 'SGD' 'BRL' 'BDT' 'ATS' 'BND' 'EGP' 'THB' 'GRD' 'ZAR'
 'NPR' 'IDR' 'PKR' 'MXN' 'BGL' 'EEK' 'YUM' 'MYR' 'IRR' 'CLP' 'SKK' 'LTL'
 'TWD' 'MTL' 'LVL' 'COP' 'ARS' 'UAH' 'RON' 'ALL' 'NGN' 'ILS' 'VEB' 'VND'
 'TTD' 'JOD' 'LKR' 'GEL' 'MNT' 'AZM' 'AMD' 'AED']


Finding historical CPI records for so many different currencies would be a bit of a nightmare, so we should consider the possibility of limiting the analysis to `USD` values only, which are likely the most frequent ones. Let's see.

In [29]:
for col in ["BUDGET", "GLOBAL_GROSS"]:
    non_nulls = imdb[col].notnull()
    nr_non_nulls = non_nulls.sum()
    nr_non_USD = (imdb.loc[non_nulls, col].str.startswith("USD") == False).sum()
    non_USD_pct = round(100 * nr_non_USD / nr_non_nulls, 1)
    print("Non-USD {} values: {} out of {} non-null values. ({}%)".format(col, nr_non_USD, nr_non_nulls, non_USD_pct))

Non-USD BUDGET values: 7108 out of 23710 non-null values. (30.0%)
Non-USD GLOBAL_GROSS values: 61 out of 31016 non-null values. (0.2%)


The number of non-USD global grosses is insignificant compared to the total number of non-null global grosses. This isn't quite the case for non-USD budgets, which constitute 30% of all the available budgets, but I think it's better to give up on that 30% than to go on a wild goose chase looking for all sorts of ancient CPI records. Let's set all non-USD budgets and global grosses to `nan`.

In [30]:
from numpy import nan

for col in ["BUDGET", "GLOBAL_GROSS"]:
    nulls_before = imdb[col].isna().sum()
    non_USD = imdb[col].str.startswith("USD") == False
    non_USD_count = non_USD.sum()
    imdb.loc[non_USD, col] = nan    
    nulls_after = imdb[col].isna().sum()

    print("Null {} values increased by: {} (Should be {})".format(col, nulls_after - nulls_before, non_USD_count))

Null BUDGET values increased by: 7108 (Should be 7108)
Null GLOBAL_GROSS values increased by: 61 (Should be 61)


At this point, the only currency in both `BUDGET` and `GLOBAL_GROSS` should be `USD` (or `nan`, in case of null values.)

In [31]:
print_all_currencies()

[nan 'USD']


#### Currency consistency check

While it's perfectly normal for worldwide grosses to be expressed in USD, regardless of the movie's country of origin, that would be a little suspect for budgets. All the budgets we have left now were expressed in USD to begin with, so we can check whether the USA is at least among the countries of origin for each of the movies corresponding to these budgets.

In [32]:
country_currency_mismatch_count = imdb.loc[(imdb["BUDGET"].notnull()) & (imdb["COUNTRY"].str.contains("USA") == False)].shape[0]
country_currency_mismatch_count

3203

For the 3203 movies above, there's a chance that the reported budget is wrong, because the currency doesn't match the country of origin. In our case, 3203 budgets is a lot of data to throw away, but cheking and correcting them manually would be a time sink. If they're wrong, they'd skew the analysis, so it's best to `nan` these too.

In [33]:
null_budgets_before = imdb["BUDGET"].isna().sum()
imdb.loc[(imdb["BUDGET"].isna() == False) & (imdb["COUNTRY"].str.contains("USA") == False), "BUDGET"] = nan
null_budgets_after = imdb["BUDGET"].isna().sum()

print("Null BUDGET values increased by: {} (Should be {})".format(null_budgets_after - null_budgets_before, country_currency_mismatch_count))

Null BUDGET values increased by: 3203 (Should be 3203)


At this point, the string `USD` may as well be removed from both the `BUDGET` and `GLOBAL_GROSS` columns. The columns themselves can be turned into `float` and renamed to `BUDGET_USD` and `GLOBAL_GROSS_USD` for clarity.

In [34]:
imdb["BUDGET"] = imdb["BUDGET"].str.replace(r"USD ", "", regex = False).astype(float)
imdb["GLOBAL_GROSS"] = imdb["GLOBAL_GROSS"].str.replace(r"USD ", "", regex = False).astype(float)

imdb.rename(columns = {"BUDGET":"BUDGET_USD","GLOBAL_GROSS":"GLOBAL_GROSS_USD"}, inplace = True)
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                85855 non-null  object 
 1   TITLE             85855 non-null  object 
 2   RELEASE_YEAR      85855 non-null  int32  
 3   GENRE             85855 non-null  object 
 4   LENGTH_MIN        85855 non-null  int64  
 5   COUNTRY           85855 non-null  object 
 6   AVG_SCORE         85855 non-null  float64
 7   VOTES             85855 non-null  int64  
 8   BUDGET_USD        13399 non-null  float64
 9   GLOBAL_GROSS_USD  30955 non-null  float64
dtypes: float64(3), int32(1), int64(2), object(4)
memory usage: 6.2+ MB


#### Currency conversion

At this point, all that's left is adjusting budgets and grosses for inflation. To do that, we only need historical CPI values for the USD, available in the `hist_CPIs.csv` file mentioned in the introduction. Unfortunately, historical CPIs before 1913 aren't available, so, for movies released before 1913, adjusting budget or gross values for inflation won't be possible.

In [35]:
print("Movies released before 1913: {}".format(len(imdb.query("RELEASE_YEAR < 1913"))))

Movies released before 1913: 12


That's not a lot of entries, so we can exclude them from currency analyses and replace their budget and grosses with nulls.

In [36]:
imdb.loc[imdb["RELEASE_YEAR"] < 1913, ["BUDGET_USD", "GLOBAL_GROSS_USD"]] = nan

The rest of the budget and gross values can all be adjusted for inflation. To proceed with the conversion, we need to extract the historical CPI values and organise them in a convenient format. We'll use a dictionary, `hist_CPIs`, with `(year, CPI-of-year)` key-value pairs. To create it, we need a list of release years of movies for which the budget and/or global gross are available. We can then associate each of these years to its CPI. Let's first load the CSV with historical CPI values and have a look at it. It only has two columns, one of which is convenient to set as the `index`.

In [37]:
hist_CPIs_file = pd.read_csv("hist_CPIs.csv", header = 0, index_col = "YEAR")
hist_CPIs_file.head()

Unnamed: 0_level_0,USD_CPI
YEAR,Unnamed: 1_level_1
1913,9.9
1914,10.0
1915,10.1
1916,10.9
1917,12.8


Now we make a list of available release years, look them up in the CPI file, and use the values to populate our year-CPI dictionary.

In [38]:
# For each movie with a known budget or global gross, extract the release year.
non_na_budget_gross_mask = (imdb["BUDGET_USD"].notnull() | imdb["GLOBAL_GROSS_USD"].notnull())
release_years = list(imdb.loc[non_na_budget_gross_mask, "RELEASE_YEAR"].unique())

hist_CPIs = {year : hist_CPIs_file.at[year, "USD_CPI"] for year in release_years}

2021 wasn't among the available release years:

In [39]:
max(release_years)

2020

so let's add it manually, because the 2021 CPI is kind of indispensable to adjust for inflation.

In [40]:
hist_CPIs[2021] = hist_CPIs_file.at[2021, "USD_CPI"]

Now budgets and global grosses can be converted to 2021 USD dollars.

In [41]:
def adjust_for_inflation(series):
    """Adjusts USD amounts for inflation, converting them to 2021 USD, rounded to nearest integer.
    `series` is a series with two rows; the first one is the (movie release) year we're converting from, 
    while the second one is the amount of money to be converted.
    """
    release_year = series.iloc[0] 
    amount = series.iloc[1] 
    
    USD2021 = round((hist_CPIs[2021] / hist_CPIs[release_year]) * amount)
    return USD2021

imdb["BUDGET_ADJUSTED"] = imdb.loc[imdb["BUDGET_USD"].isnull() == False, ["RELEASE_YEAR", "BUDGET_USD"]].apply(adjust_for_inflation, axis = 1)
imdb["GROSS_ADJUSTED"] = imdb.loc[imdb["GLOBAL_GROSS_USD"].isnull() == False, ["RELEASE_YEAR", "GLOBAL_GROSS_USD"]].apply(adjust_for_inflation, axis = 1)

Now we only have to check that the adjusted values make sense before swapping them with the originals.

In [42]:
adjusted_check =imdb[["TITLE", "RELEASE_YEAR", "BUDGET_USD", "BUDGET_ADJUSTED", "GLOBAL_GROSS_USD", "GROSS_ADJUSTED"]]
adjusted_check.describe()

Unnamed: 0,RELEASE_YEAR,BUDGET_USD,BUDGET_ADJUSTED,GLOBAL_GROSS_USD,GROSS_ADJUSTED
count,85855.0,13397.0,13397.0,30955.0,30955.0
mean,1993.500891,16300550.0,26211040.0,22527860.0,33981480.0
std,24.21642,31411700.0,41329580.0,88819070.0,137293800.0
min,1894.0,0.0,0.0,1.0,3.0
25%,1979.0,662141.0,1576074.0,114952.0,158765.5
50%,2003.0,3346500.0,10316370.0,1108231.0,1483916.0
75%,2013.0,18000000.0,32425330.0,8299774.0,12010320.0
max,2020.0,356000000.0,392764100.0,2797801000.0,7856006000.0


The adjusted values are comparable in terms of order of magnitude, so there *shouldn't* be anything weird going on with the budgets, and similarly with the grosses. To double-check that, we can use [this IMDB list](https://www.imdb.com/list/ls026442468/) of highest-grossing blockbusters adjusted for inflation.

In [43]:
adjusted_check[["TITLE", "GLOBAL_GROSS_USD", "GROSS_ADJUSTED"]].sort_values("GROSS_ADJUSTED", axis = 0, ascending = False).head(10)

Unnamed: 0,TITLE,GLOBAL_GROSS_USD,GROSS_ADJUSTED
3266,Gone with the Wind,402352600.0,7856006000.0
4104,Bambi,267447200.0,4453077000.0
31086,Titanic,2195170000.0,3711957000.0
49415,Avatar,2790439000.0,3530653000.0
2827,Snow White and the Seven Dwarfs,184925500.0,3485332000.0
18216,Star Wars,775768900.0,3474318000.0
73865,Avengers: Endgame,2797801000.0,2969586000.0
16015,The Exorcist,441306100.0,2697534000.0
17068,Jaws,471961400.0,2380861000.0
67523,Star Wars: Episode VII - The Force Awakens,2068224000.0,2368422000.0


The highest gross value is *Gone with the Wind*. According to the very IMDB page I linked above, adjusting for inflation for such old movies is very hard. *Gone with the Wind* is first-place on their list too, but their inflation-adjusted gross is around 3.75 billion, far less than our 7.8 billion. However, that IMDB page shows two different values for the gross:

![](gww.png)

In the case of *Gone with the Wind*, the  `Gross` value (highlighted in green) is 198.68 million USD. There's another suspect value, `Real Worldwide Box Office`, which we'll get to in a minute, but for now, let's focus on the `Gross` value. Using the inflation-adjusting formula we've used so far, this value becomes very close to the the `Adjusted` value in the picture above:

In [44]:
gww_test = adjust_for_inflation(pd.Series([1939, 198680000]))
print(gww_test)

3879262734


That is, about 3.87 billion USD, while the `Adjusted` value in the screenshot is about 3.75.  Other movies on that page show similar patterns,  which suggests that the `Adjusted` amounts on the IMDB page are calculated from the `Gross` value, probably using a very similar formula to what we used. The `GLOBAL_GROSS_USD` column in our dataset doesn't contain the same values as this `Gross` field on the page, but rather the `Real World Wide Box Office` values; for example, in the case of *Gone with the Wind*, `GLOBAL_GROSS_USD` is

In [45]:
imdb.loc[3266, ["TITLE", "GLOBAL_GROSS_USD"]]

TITLE               Gone with the Wind
GLOBAL_GROSS_USD           402352579.0
Name: 3266, dtype: object

That is, about 402.3 million USD, as shown in the blue box above. Again, this is true of other movies as well on that page, and while it's hard to tell what's the difference between `Gross` and `Real Worldwide Box Office`, it's clear that in our dataset we have the latter, and that our formula is accurate enough, as feeding it `Gross` values from the IMDB page yields values very close to the `Adjusted` values on the same page. For the scope of this project, this level of accuracy is sufficient, so we can replace our `BUDGET_USD` and `GLOBAL_GROSS_USD` columns with their adjusted counterparts.

In [46]:
# Get rid of the non-adjusted columns.
imdb.drop(columns = ["BUDGET_USD", "GLOBAL_GROSS_USD"], inplace = True)

# Rename the adjusted columns like the columns we just dropped.
imdb.rename(columns = {"BUDGET_ADJUSTED": "BUDGET_USD", "GROSS_ADJUSTED": "GLOBAL_GROSS_USD"}, inplace = True)

## Second dataset: `imdb_ratings`

A significant part of the main project is about movie scores, and for that we'll needed data from the `imdb_ratings.csv` file.

In [47]:
imdb_ratings = pd.read_csv("imdb_ratings.csv", header = 0)
imdb_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 49 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   imdb_title_id              85855 non-null  object 
 1   weighted_average_vote      85855 non-null  float64
 2   total_votes                85855 non-null  int64  
 3   mean_vote                  85855 non-null  float64
 4   median_vote                85855 non-null  float64
 5   votes_10                   85855 non-null  int64  
 6   votes_9                    85855 non-null  int64  
 7   votes_8                    85855 non-null  int64  
 8   votes_7                    85855 non-null  int64  
 9   votes_6                    85855 non-null  int64  
 10  votes_5                    85855 non-null  int64  
 11  votes_4                    85855 non-null  int64  
 12  votes_3                    85855 non-null  int64  
 13  votes_2                    85855 non-null  int

Just to be picky, let's make extra-sure that the two datasets have the same IDs.

In [48]:
imdb_ratings["imdb_title_id"].equals(imdb["ID"])

True

That was expected, but great news nonetheless because it's the obvious column of choice to eventually merge the two datasets. Therefore, it makes sense to rename it too as `ID`. However, there'll be more renaming to do, so rather than doing it one column at a time, let's just keep track of what needs renaming to what, and then we'll call `rename` only once.

In [49]:
new_column_names = {"imdb_title_id": "ID"} 

#### Identifying and dropping unnecessary columns

There are columns that are likely duplicates: `total_votes`, which is probably the same as `VOTES` from the `imdb` dataset, and `weighted_average_vote`, which is probably the same as `AVG_SCORE` again from `imdb`. Before we delete them, let's make sure that they are indeed redundant.

In [50]:
# Check the votes columns.
imdb_votes = imdb["VOTES"]
total_votes = imdb_ratings["total_votes"]
print('Does VOTES contain the same data as total_votes? ', imdb_votes.equals(total_votes))

# Check the score columns.
avg_from_imdb = imdb["AVG_SCORE"]
wavg_from_ratings = imdb_ratings["weighted_average_vote"]

print("Are votes in AVG_SCORE the same as those in weighted_average_vote?", avg_from_imdb.equals(wavg_from_ratings))

Does VOTES contain the same data as total_votes?  True
Are votes in AVG_SCORE the same as those in weighted_average_vote? True


As expected, both `total_votes`  and `weighted_average_vote` can be eliminated. For clarity, we'll also rename `AVG_SCORE` in the first dataset to `WAVG_SCORE`, to remind ourselves that it is some kind of weighted average. (According to IMDB, this column is actually calculated according to some [secret sauce](https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV#) to ensure vote reliability.)

In [51]:
imdb_ratings.drop(columns = ["total_votes", "weighted_average_vote"], inplace = True)
imdb.rename(columns = {"AVG_SCORE": "WAVG_SCORE"}, inplace = True)

The `mean_vote` column, instead is not the same as the `WAVG_SCORE` column:

In [52]:
mean_from_ratings = imdb_ratings["mean_vote"]
print("Are votes in WAVG_SCORE the same as those in mean_vote?", mean_from_ratings.equals(imdb["WAVG_SCORE"]))

Are votes in WAVG_SCORE the same as those in mean_vote? False


`mean_vote` is probably just a simple average and therefore (supposedly) less reliable or interesting than IMDB's secret sauce average, but before we drop it, let's make sure that it is indeed just a simple average.

In [53]:
def calculate_mean(breakdown):    
    """Calculates a simple arithmetic mean of a movie's score from its score breakdown.
    The score breakdown contains the number of votes each of the 10 possible scores has received.
    
    `breakdown` is a series representing a movie's score breakdown, indexed from `votes_10` to `votes_1`
    """    
    total_votes = breakdown.sum()
    scores = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
    
    # Multiply `votes_10` by 10, `votes_9` by 9, etc. Then sum them together and divide by total votes.
    mean_score = (breakdown * scores).sum() / total_votes
    mean_score = mean_score.round(1)
        
    return mean_score

# Extract a sub dataframe containing only how many votes each score has received for each movie.
score_breakdown = imdb_ratings.loc[: , "votes_10" : "votes_1"]

# Calculate the mean score of each movie.
calculated_mean_score = score_breakdown.apply(calculate_mean, axis = 1)

# See if `calculated_mean_score` and the `mean_vote` column are different.
mismatches = calculated_mean_score != imdb_ratings["mean_vote"]
mismatches.value_counts()

False    85805
True        50
dtype: int64

In the vast majority of cases, the calculated mean score is the same as the corresponding values in the `mean_vote` column. This alone is proof enough that `mean_vote` is just a simple average, but out of curiosity, let's check how different those 50 mismatches are.

In [54]:
mean_vote_mismatch = imdb_ratings.loc[mismatches, "mean_vote"] # The 50 `mean_vote` values that differ from the calculated ones.
calculated_score_mismatch = calculated_mean_score[mismatches]  # The 50 calculated mean scores that differ from the 50 above.

mismatch_percent = ((calculated_score_mismatch - mean_vote_mismatch) / mean_vote_mismatch) * 100
print("Min error: {}\nMax error: {}\nMean error: {}".format(mismatch_percent.min(), mismatch_percent.max(), mismatch_percent.mean()))

Min error: -2.85714285714286
Max error: 2.85714285714286
Mean error: 0.26802026681855945


On average, the only 50 mismatching scores were about 0.26% larger than the original values in `mean_vote`. This is probably due to floating-point ops approximations, so it's safe to say that `mean_vote` is indeed a simple arithmetic mean, and so we can drop it.

In [55]:
imdb_ratings.drop(columns = ["mean_vote"], inplace = True)

As a matter of fact, in the main project we won't be needing the `median_vote` column either, nor any of the `votes_10`, `votes_9`, etc., columns (which contain the number of voters that gave a specific score to a movie). We also won't need any of the columns dealing with top voters, nor with voter location, so we can drop them too. In the main project, we won't be needing any of the `allgenders` columns either, but for the moment, we'll keep the `allgenders` votes columns and drop the others.

In [56]:
cols_to_drop = (["median_vote"] + 
                ["votes_{}".format(i) for i in range(1, 11)] + 
                ["top1000_voters_rating", "top1000_voters_votes", "us_voters_rating", "non_us_voters_rating", "us_voters_votes", "non_us_voters_votes"] +
                ["allgenders_{}age_avg_vote".format(i) for i in [0, 18, 30, 45]])
imdb_ratings.drop(columns = cols_to_drop, inplace = True)

Now we can rename columns with the same style used in the previous file. Note that, as far as we know, the only secret-sauce weighted average is the final score that IMDB gives to each movie, stored in the `WAVG_SCORE` column. All other averages, such as the average score of all male or female voters, for example, are likely just normal averages, so their new names will contain the abbreviation `AVG`, not `WAVG`.

In [57]:
def update_name(old_name):
    """Updates a column name according to predefined criteria.
    `old_name` is a string containing the name to update.
    """    
    replacements = {"_allages": "",
                    "allgenders": "MF",
                    "females": "F",
                    "males": "M",                                        
                    "18age": "1829",
                    "30age": "3044",
                    "45age": "45PLUS",
                    "0age": "017",                    
                    "avg_vote": "AVG_SCORE"}    
    
    new_name = old_name
    for r in replacements:
        new_name = new_name.replace(r, replacements[r])        
    return new_name.upper()


old_columns_1_to_end = imdb_ratings.columns[1:]

# Update all old column names.
for old_name in old_columns_1_to_end:
    new_column_names[old_name] = update_name(old_name)

imdb_ratings.rename(columns = new_column_names, inplace = True)
imdb_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  85855 non-null  object 
 1   MF_017_VOTES        33359 non-null  float64
 2   MF_1829_VOTES       85149 non-null  float64
 3   MF_3044_VOTES       85845 non-null  float64
 4   MF_45PLUS_VOTES     85775 non-null  float64
 5   M_AVG_SCORE         85854 non-null  float64
 6   M_VOTES             85854 non-null  float64
 7   M_017_AVG_SCORE     27411 non-null  float64
 8   M_017_VOTES         27411 non-null  float64
 9   M_1829_AVG_SCORE    84390 non-null  float64
 10  M_1829_VOTES        84390 non-null  float64
 11  M_3044_AVG_SCORE    85843 non-null  float64
 12  M_3044_VOTES        85843 non-null  float64
 13  M_45PLUS_AVG_SCORE  85754 non-null  float64
 14  M_45PLUS_VOTES      85754 non-null  float64
 15  F_AVG_SCORE         85774 non-null  float64
 16  F_VO

#### Handling nulls

There are quite a few columns with nulls to handle.

In [58]:
cols_with_nulls = imdb_ratings.columns[imdb_ratings.isna().any()]
imdb_ratings[cols_with_nulls].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MF_017_VOTES        33359 non-null  float64
 1   MF_1829_VOTES       85149 non-null  float64
 2   MF_3044_VOTES       85845 non-null  float64
 3   MF_45PLUS_VOTES     85775 non-null  float64
 4   M_AVG_SCORE         85854 non-null  float64
 5   M_VOTES             85854 non-null  float64
 6   M_017_AVG_SCORE     27411 non-null  float64
 7   M_017_VOTES         27411 non-null  float64
 8   M_1829_AVG_SCORE    84390 non-null  float64
 9   M_1829_VOTES        84390 non-null  float64
 10  M_3044_AVG_SCORE    85843 non-null  float64
 11  M_3044_VOTES        85843 non-null  float64
 12  M_45PLUS_AVG_SCORE  85754 non-null  float64
 13  M_45PLUS_VOTES      85754 non-null  float64
 14  F_AVG_SCORE         85774 non-null  float64
 15  F_VOTES             85774 non-null  float64
 16  F_01

The first four columns will eventually be dropped, so we don't really care about their nulls. As for the others, seeing as how they all have pairwise the same amount of non-nulls (and, hence, of nulls), it seems reasonable that pairs of columns have nulls in the exact same places. So for example, `F_3044_AVG_SCORE` should have nulls exactly where `F_3044_VOTES` does. (This makes sense, because if we have no votes from females between the ages of 30 and 44, we certainly don't have an average score for them either.) Each column pairs up with the one immediately after it, so checking whether this guess is correct is fairly easy.

In [59]:
exit_message = "No unpaired nulls found across any pairs of columns."

for i in range(4, len(cols_with_nulls), 2):    
    col1 = cols_with_nulls[i]
    col2 = cols_with_nulls[i + 1]
    # Find dataset entries where EITHER col1 OR col2 is null, but not both.
    unpaired_nulls = (imdb_ratings[col1].isna() ^ imdb_ratings[col2].isna()).sum()    
    if unpaired_nulls > 0:
        exit_message = ""
        print("Column {} and {} have {} unpaired nulls.".format(col1, col2, unpaired_nulls))
print(exit_message)    

No unpaired nulls found across any pairs of columns.


Any null values in these columns obviously mean that the specific category represented by the column did not cast any vote for some movies. For example, it may well be that some movies weren't scored by any males between the ages of 0 and 17. However, it should still be the case that the sum of all votes from males and females add up to the total of all votes for any given movie, for example. If these sums are *smaller* than the relevant total, it's no big deal: some of the votes came from users whose sex or age was unknown. If these sums are *bigger* than the total, something quite literally doesn't add up.

In [60]:
def determine_mismatch(difference):
    """Takes in a number and determines its sign or if it's zero. The number represents the difference between the sum 
       of different Series and their expected total. Used to determine if the sum adds up correctly or if it's smaller or larger than it should.
       `difference` is the number whose sign must be checked."""
    if difference < 0: return "Smaller"
    if difference > 0: return "Larger"
    return "Equal"

# Age&sex brackets: Is M plus F for each age bracket equal to the total of the age bracket?
for bracket in ["017", "1829", "3044", "45PLUS"]:
    m_plus_f = imdb_ratings["M_{}_VOTES".format(bracket)] + imdb_ratings["F_{}_VOTES".format(bracket)]
    bracket_mismatch_type = (m_plus_f - imdb_ratings["MF_{}_VOTES".format(bracket)]).apply(determine_mismatch)
bracket_mismatch_type.value_counts()

Smaller    62743
Equal      23112
dtype: int64

We don't have any case where the sum exceeds the total, so, as said, this only means that for some movies, the age or sex of the voter wasn't known.

In [61]:
# Sex: does the sum of age brackets, for each sex, add up to the total votes for that sex?
for sex in ["M", "F"]:
    brackets_sum = 0
    for bracket in ["017", "1829", "3044", "45PLUS"]:
        brackets_sum += imdb_ratings["{}_{}_VOTES".format(sex, bracket)]
    sex_bracket_mismatch_type = (brackets_sum - imdb_ratings["{}_VOTES".format(sex)]).apply(determine_mismatch)
    print(sex_bracket_mismatch_type.value_counts())

Equal      58784
Smaller    27071
dtype: int64
Equal      65766
Smaller    20089
dtype: int64


Same story as above; for both sexes, in some cases the age bracket of voters wasn't known, and that's not a problem. Similarly, below we see that in the vast majority of cases, the sex of some voters wasn't known.

In [62]:
# Sex: does the sum of M and F add up to VOTES?
m_plus_f = imdb_ratings["M_VOTES"] + imdb_ratings["F_VOTES"] 
sex_mismatch_type = (m_plus_f - imdb["VOTES"]).apply(determine_mismatch)
print(sex_mismatch_type.value_counts())

Smaller    85768
Equal         87
dtype: int64


Ultimately, what this all means is that these null-containing columns are fine; null values in columns representing averages should stay null, whereas we can drop any columns that represent the number of votes, because we won't be needing them anymore.

In [63]:
cols_to_drop = [col for col in cols_with_nulls if col.endswith("_VOTES")]
imdb_ratings.drop(columns = cols_to_drop, inplace = True)

In [64]:
imdb_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  85855 non-null  object 
 1   M_AVG_SCORE         85854 non-null  float64
 2   M_017_AVG_SCORE     27411 non-null  float64
 3   M_1829_AVG_SCORE    84390 non-null  float64
 4   M_3044_AVG_SCORE    85843 non-null  float64
 5   M_45PLUS_AVG_SCORE  85754 non-null  float64
 6   F_AVG_SCORE         85774 non-null  float64
 7   F_017_AVG_SCORE     22117 non-null  float64
 8   F_1829_AVG_SCORE    79334 non-null  float64
 9   F_3044_AVG_SCORE    84911 non-null  float64
 10  F_45PLUS_AVG_SCORE  83057 non-null  float64
dtypes: float64(10), object(1)
memory usage: 7.2+ MB


### Merging the two datasets

All that is left to do at this point is to merge `imdb` with `imdb_ratings`. 

In [65]:
imdb_merged = pd.merge(imdb, imdb_ratings, how = "inner", on = "ID")
imdb_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 85855 entries, 0 to 85854
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  85855 non-null  object 
 1   TITLE               85855 non-null  object 
 2   RELEASE_YEAR        85855 non-null  int32  
 3   GENRE               85855 non-null  object 
 4   LENGTH_MIN          85855 non-null  int64  
 5   COUNTRY             85855 non-null  object 
 6   WAVG_SCORE          85855 non-null  float64
 7   VOTES               85855 non-null  int64  
 8   BUDGET_USD          13397 non-null  float64
 9   GLOBAL_GROSS_USD    30955 non-null  float64
 10  M_AVG_SCORE         85854 non-null  float64
 11  M_017_AVG_SCORE     27411 non-null  float64
 12  M_1829_AVG_SCORE    84390 non-null  float64
 13  M_3044_AVG_SCORE    85843 non-null  float64
 14  M_45PLUS_AVG_SCORE  85754 non-null  float64
 15  F_AVG_SCORE         85774 non-null  float64
 16  F_01

The columns could use some rearrangement.

In [66]:
rearranged_columns = (imdb_merged.columns[0:6].to_list() + 
                      ["BUDGET_USD", "GLOBAL_GROSS_USD", "WAVG_SCORE", "VOTES"]                     
                      + imdb_merged.columns[10:].to_list())
imdb_final = imdb_merged.reindex(columns = rearranged_columns)
imdb_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 85855 entries, 0 to 85854
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  85855 non-null  object 
 1   TITLE               85855 non-null  object 
 2   RELEASE_YEAR        85855 non-null  int32  
 3   GENRE               85855 non-null  object 
 4   LENGTH_MIN          85855 non-null  int64  
 5   COUNTRY             85855 non-null  object 
 6   BUDGET_USD          13397 non-null  float64
 7   GLOBAL_GROSS_USD    30955 non-null  float64
 8   WAVG_SCORE          85855 non-null  float64
 9   VOTES               85855 non-null  int64  
 10  M_AVG_SCORE         85854 non-null  float64
 11  M_017_AVG_SCORE     27411 non-null  float64
 12  M_1829_AVG_SCORE    84390 non-null  float64
 13  M_3044_AVG_SCORE    85843 non-null  float64
 14  M_45PLUS_AVG_SCORE  85754 non-null  float64
 15  F_AVG_SCORE         85774 non-null  float64
 16  F_01

The dataset can now be exported as its own CSV file, so that analyses can be done without having to rerun all the cleaning first.

In [67]:
imdb_final.to_csv("imdb_final.csv", index = False)

Just a little test, to make sure the saved dataset reads just fine.

In [68]:
imdb_test = pd.read_csv("imdb_final.csv", header = 0)
imdb_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  85855 non-null  object 
 1   TITLE               85855 non-null  object 
 2   RELEASE_YEAR        85855 non-null  int64  
 3   GENRE               85855 non-null  object 
 4   LENGTH_MIN          85855 non-null  int64  
 5   COUNTRY             85855 non-null  object 
 6   BUDGET_USD          13397 non-null  float64
 7   GLOBAL_GROSS_USD    30955 non-null  float64
 8   WAVG_SCORE          85855 non-null  float64
 9   VOTES               85855 non-null  int64  
 10  M_AVG_SCORE         85854 non-null  float64
 11  M_017_AVG_SCORE     27411 non-null  float64
 12  M_1829_AVG_SCORE    84390 non-null  float64
 13  M_3044_AVG_SCORE    85843 non-null  float64
 14  M_45PLUS_AVG_SCORE  85754 non-null  float64
 15  F_AVG_SCORE         85774 non-null  float64
 16  F_01