# Part I: Loading, exploring and cleaning data

![](img/cleaning.jpeg)


Data has been obtained from the Internet movie database:
- https://www.imdb.com/search/title?title_type=feature&release_date=1970-01-01,&countries=in&languages=hi&page=1

In this section we'll learn:
- How to load data from different sources
- How to combine data from different sources
- Exploring the content of the loaded data: the raw data, and statistics and aggregates
- Cleaning the data, removing wrong values, formatting it in a useful representation, dealing with missing values...

In [1]:
import os

MONTHS = ['January', 'February', 'March', 'April', 'May', 'June',
          'July', 'August', 'September', 'October', 'November', 'December']

MOVIES_LIST_FNAME = os.path.join('data', 'movies_from_list.csv.gz')
MOVIES_DETAIL_FNAME = os.path.join('data', 'movies_from_list.jsonl.bz2')

### Load data

pandas supports many formats for loading data. You can see most of them in:
- https://pandas.pydata.org/pandas-docs/stable/api.html#input-output

Also, data can be imported into pandas from Python objects (list, dict, tuple...) with the `DataFrame` and `Series` constructor.

**EXERCISE:** Load the datasets defined in `MOVIES_LIST_FNAME` and `MOVIES_DETAIL_FNAME` into pandas `DataFrame` objects, and create a `Series` with the content of `MONTHS`.

### Exploring

There are many ways to explore the data in a pandas container. Try calling on them the next methods: `.head()`, `.tail()`, `.head().T`, `.describe()` and `.info()`.

### Combining datasets

pandas provides different ways to combine datasets into one. For people familiar with SQL, the functionality is similar to `JOIN` and `UNION`.

Documentation can be found here:
- https://pandas.pydata.org/pandas-docs/stable/merging.html

In the case of adding columns of one `DataFrame` to another, we can use `DataFrame.join` (equivalent to SQL `JOIN`). Documentation can be found here:
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html

**EXERCISE:** In this case we want to join the two previously loaded `DataFrame` objects into one. To make sure we implement it correctly, we need to see the check the shape of both `DataFrame` with the method `.shape`, and also check the shape of the resulting `DataFrame` to make sure we are not duplicating or missing rows. When joining, the important parameters are `on` and `how`.

### Selecting data

pandas has two main containers, `Series` and `DataFrame`. You can think as a `DataFrame` as the content of a spreadsheet, or a SQL table. And `Series` as one of its columns.

More formally, a `DataFrame` is a 2-dimensional structure, where each element is labelled both by its column and its row.

| Movie | Year | Genre |
| --- | --- | --- |
| **PK** | 2014 | Comedy |
| **Kabhi Khushi Kabhie Gham...** | 2001 | Drama |
| **3 idiots** | 2009 | Comedy |

As you can see in the example, every value can be defined by its position based on its labels. For example, the value at `PK`/`Genre` is `Comedy`. In pandas, this would be `movies.loc['PK', 'Genre']`.

Also, every value cab be defined by its position. For example, in the second row, and first column, the value is `2001`. In pandas this would be `movies.iloc[1, 0]` (remember that in Python, the first index is `0`, the second is `1`...)

**Exercise:** Can you obtain the value in the column `storyline`, and the row `tt2905838` in the movies `DataFrame`. Can you guess which is the movie? :)

To select a whole column from a `DataFrame` (which will be a `Series`) we can use `movies['rating']` (like getting a value from a `dict`). If the name of the column does not contain special characters, we can also use `movies.rating` (like for a class attribute or a `namedtuple`).

Then, we can perform operations with the resulting `Series`:
- Compute the maximum, the minimum, the median, or any other provided statistic:
   - `movies['rating'].min()`
   - `movies['rating'].max()`
   - `movies['rating'].median()`
- Double the rating of each movie (if we feel generous): `movies['rating'] * 2`
- Make comparisons: `movies['rating'] > .8` (this will return a `Series` with `True` and `False` values, depending on whether the condition is satisfied in each row)

The last point is quite important, as boolean `Series` can be used to filter the data. The syntax is `movies[condition]` where condition is a `Series` or boolean values. Note that the syntax is the same as `movies[column_name]`, and pandas will check what is the type and size of what is between the squared brackets, to decide whether to return a column, or a filtered `DataFrame`.

**Exercise:** Return all the movies that have a rating greater than `9.5` and that their year is later than 2015.

### Extracting information

There is some information that needs to be transformed before we can use it. For example, the column with the number of votes. If you take a look, you can see how instead of having the number of votes as a number, the column contains a string including thousand separators (e.g. `1,000`). This is how it was presented in the original source.

But this format is not appropiate to do operations. For example, if we want to check the smallest (`.min()`) or the largest (`.max()`) number of votes in our data, we're not getting the right results. This is because when dealing with strings, `'9'` is greater than `'10'` in the same was `z` is "greater" than `ab`.

**Exercises:**
1. Create a new column `rating_votes` with the number of votes converted as a number. The first step is to get a string representation that does not contain the commas. This can be done with the method `.str.replace()`, replacing the commas with empty strings. Second is to convert the column type, from a Python object to a float representation. This can be done with the method `.astype()`.

1. Get the smallest and largest values for the number of votes.

1. Check how many missing values we have in the data (pandas provides the methods `.isnull()` and `.notnull()`). Remember that in Python, `True == 1` and `False == 0`, so you can sum `True` and `False` values.

1. Plot a histogram of the data, to get an idea on the frequency of the number of votes. pandas provides a `.hist()` method. You can play with the parameter `bins` and see how the number of bins makes the plot change.

### Save data as JSON

Finally, in order to persist the data that we have created (by transforming the original data), we can export it. There are many formats supported by `pandas`, they are listed here:
- https://pandas.pydata.org/pandas-docs/stable/api.html#id12

The type of format used will depend on the case. Here there are some examples:
- csv: When we want to open the data with a spreadsheet (discourage in other cases, as it looses type information, and it's inefficient)
- json: When there are complex data structures we want to preserve (like `list`, `dict`...)
- hdf and parquet: When we want to save the data in an efficient way (consume less storage and load faster)

**Exercise:** Save the data as JSON. Try saving just couple of rows first, with different values of `orient`, to see which one looks more appropriate. Also test how the parameter `lines` affects the output.