# A) Data Cleaning
<p> To start, I took a look for general properties of the data </p>

In [1]:
import pandas as pd
df = pd.read_excel("data/movies.xlsx")
df.head()

Unnamed: 0,Film,Year,Script Type,Rotten Tomatoes critics,Metacritic critics,Average critics,Rotten Tomatoes Audience,Metacritic Audience,Rotten Tomatoes vs Metacritic deviance,Average audience,...,of Gross earned abroad,Budget ($million),Budget recovered,Budget recovered opening weekend,Distributor,IMDb Rating,IMDB vs RT disparity,Release Date (US),Oscar Winners,Oscar Detail
0,300,2007,adaptation,60,51,56,89.0,71,18,80,...,53.82%,65,701.64%,109.05%,,,,"Mar 9, 2007",,
1,3:10 to Yuma,2007,remake,88,76,82,86.0,73,13,80,...,23.18%,50,139.56%,28.07%,,,,"Sep 7, 2007",,
2,30 Days of Night,2007,adaptation,50,53,52,56.0,65,-9,61,...,47.31%,32,234.67%,49.85%,,,,"Oct 19, 2007",,
3,Across the Universe,2007,original screenplay,54,56,55,82.0,73,9,78,...,17.11%,45,65.26%,8.50%,,,,"Oct 12, 2007",,
4,Alien vs. Predator - Requiem,2007,sequel,14,29,22,31.0,45,-14,38,...,67.57%,40,322.21%,25.15%,,,,"Dec 25, 2007",,


In [7]:
df.columns

Index(['Film', 'Year', 'Script Type', 'Rotten Tomatoes  critics',
       'Metacritic  critics', 'Average critics ', 'Rotten Tomatoes Audience ',
       'Metacritic Audience ', 'Rotten Tomatoes vs Metacritic  deviance',
       'Average audience ', 'Audience vs Critics deviance ', 'Primary Genre',
       'Genre', 'Opening Weekend', 'Opening weekend ($million)',
       'Domestic Gross', 'Domestic gross ($million)',
       'Foreign Gross ($million)', 'Foreign Gross', 'Worldwide Gross',
       'Worldwide Gross ($million)', ' of Gross earned abroad',
       'Budget ($million)', ' Budget recovered',
       ' Budget recovered opening weekend', 'Distributor', 'IMDb Rating',
       'IMDB vs RT disparity', 'Release Date (US)', 'Oscar Winners',
       'Oscar Detail'],
      dtype='object')

### General Observations About Columns
<ol>
<li>
Some are duplicates of others but in different format, example: <code>Domestic Gross</code> , <code>Domestic gross ($million)</code> is the same thing but with a different unit of measurement.
</li>
<li>
Some columns are calculable (`Average audience`,`Budget recovered`,etc.), so in order to avoid mistakes in the dataset, we drop these columns and we recalculate them.
</li>
</ol>

### Columns dropped because of reasons above
<table>
<tr>
    <th>Column Dropped</th>
    <th>Reason</th>
</tr>
<tr>
    <td>DOMESTIC GROSS (\$MILLION)</td>
    <td>Duplicate</td>
</tr>
<tr><td>FOREIGN GROSS (\$MILLION)</td><td>Duplicate</td></tr>
<tr><td>OPENING WEEKEND (\$MILLION)</td><td>Duplicate</td></tr>
<tr><td>WORLDWIDE GROSS (\$MILLION)</td><td>Duplicate</td></tr>
<tr><td>OF GROSS EARNED ABROAD</td><td>Calculable</td></tr>
<tr><td>ROTTEN TOMATOES VS METACRITIC  DEVIANCE</td><td>Calculable</td></tr>
<tr><td>AVERAGE AUDIENCE</td><td>Calculable</td></tr>
<tr><td>AVERAGE CRITICS</td><td>Calculable</td></tr>
<tr><td>ROTTEN TOMATOES VS METACRITIC  DEVIANCE</td><td>Calculable</td></tr>
<tr><td>BUDGET RECOVERED OPENING WEEKEND</td><td>Calculable</td></tr>
<tr><td>BUDGET RECOVERED</td><td>Calculable</td></tr>
<tr><td>AUDIENCE VS CRITICS DEVIANCE</td><td>Calculable</td></tr>

</table>

### Observations About Column Names
- Naming is not very consistent across columns, (trailing/leading spaces, two adjacent spaces).
- Column names might not be very descriptive , or not descriptive enough
- Undesirable character for code, it might be tedious to write `Worldwide Gross ($million)` in code multiple times

## Renaming of Columns

<p>Renaming each column follows a ruleset that I came up with for consistency, readability and descriptiveness:</p>

- All letters are upper case
- All spaces are replaced by an underscore
- If the column name contains part of a brand name, it becomes abbreviated (e.g. Metacritic => MC, Rotten Tomatoes => RT)
- Percentage Symbols added were appropriate
- (Rarely Used) If column name is undesirable/non-descriptive then a more appropriate name is chosen

###### Note: Some columns are missing from the table, they will be addressed later
<table>
<tr>
<th>Column Before</th>
<th>Column After</th>
</tr>
<tr><td>FILM</td><td>TITLE</td></tr>
<tr><td>YEAR</td><td>RELEASE_YEAR</td></tr>
<tr><td>SCRIPT TYPE</td><td>SCRIPT_TYPE</td></tr>
<tr><td>ROTTEN TOMATOES  CRITICS</td><td>RT_CRITICS</td></tr>
<tr><td>METACRITIC  CRITICS</td><td>MC_CRITICS</td></tr>
<tr><td>AVERAGE CRITICS</td><td>AVERAGE_CRITICS</td></tr>
<tr><td>ROTTEN TOMATOES AUDIENCE</td><td>RT_AUDIENCE</td></tr>
<tr><td>METACRITIC AUDIENCE</td><td>MC_AUDIENCE</td></tr>
<tr><td>ROTTEN TOMATOES VS METACRITIC  DEVIANCE</td><td>RT_MC_AUDIENCE_DIFFERENCE</td></tr>
<tr><td>AVERAGE AUDIENCE</td><td>AVERAGE_AUDIENCE</td></tr>
<tr><td>AUDIENCE VS CRITICS DEVIANCE</td><td>CRITICS_AUDIENCE_DIFFERENCE</td></tr>
<tr><td>PRIMARY GENRE</td><td>PRIMARY_GENRE</td></tr>
<tr><td>GENRE</td><td>GENRE</td></tr>
<tr><td>OPENING WEEKEND</td><td>OPENING_WEEKEND</td></tr>
<tr><td>DOMESTIC GROSS</td><td>DOMESTIC_GROSS</td></tr>
<tr><td>FOREIGN GROSS</td><td>FOREIGN_GROSS</td></tr>
<tr><td>WORLDWIDE GROSS</td><td>WORLDWIDE_GROSS</td></tr>
<tr><td>OF GROSS EARNED ABROAD</td><td>%OF_GROSS_EARNED_ABROAD</td></tr>
<tr><td>BUDGET (\$MILLION)</td><td>BUDGET</td></tr>
<tr><td>BUDGET RECOVERED</td><td>%BUDGET_RECOVERED</td></tr>
<tr><td>BUDGET RECOVERED OPENING WEEKEND</td><td>%BUDGET_RECOVERED_OPENING_WEEKEND</td></tr>
<tr><td>IMDB RATING</td><td>IMDB_RATING</td></tr>
<tr><td>IMDB VS RT DISPARITY</td><td>IMDB_RT_DIFFERENCE</td></tr>
<tr><td>OSCAR WINNERS</td><td>WON_OSCAR</td></tr>
<tr><td>OSCAR DETAIL</td><td>OSCAR_DETAILS</td></tr>
</table>