# Prerequisites

- Create python virtual environment -> ```python -m venv venv```
- Activate the virtual environment -> ```. ./venv/scripts/activate```
- Install python requirements in terminal -> ```pip install -r requirements.txt```
- Kaggle API key (can be obtained from https://www.kaggle.com). Follow the following instructions: https://github.com/Kaggle/kaggle-api

# Imports

In [1]:
import kaggle as kg
import zipfile as zf
import dask.dataframe as df
import os



In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as df


# Download the dataset

In [2]:
# Make sure you have the Kaggle API installed 
kg.api.authenticate()

# Download the movies dataset
kg.api.dataset_download_files('rounakbanik/the-movies-dataset')

# Unzip the dataset
zf.ZipFile('the-movies-dataset.zip').extractall('data')

# Remove the zip file
os.remove('the-movies-dataset.zip')

# Read the data

### Movies Metadata
- adult: bool
- belongs_to_collection: json str
- budget number
- genres: json str
- homepage: str
- id: number
- imdb_id: str
- original_language: str
- original_title: str
- overview: str
- popularity: number
- poster_path: str
- production_companies: json str
- production_countries: json str
- release_date: date
- revenue: number
- runtime: number
- spoken_languages: json str
- status: str,
- tagline: str
- title: str
- video: bool
- vote_average: number
- vote_count: number

### Ratings

- userId: number
- movieId: number
- rating: number
- timestamp: timestamp

In [3]:
# Load the dataset into a Dask dataframe
movies_df = df.read_csv('data/movies_metadata.csv', delimiter=',', header=0, dtype=str)

ratings_df = df.read_csv('data/ratings.csv', delimiter=',', header=0, dtype=str)



# Print the first 5 rows of the dataframe
print(movies_df.head(5))
print(ratings_df.head(5))

   adult                              belongs_to_collection    budget  \
0  False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   
1  False                                               <NA>  65000000   
2  False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   
3  False                                               <NA>  16000000   
4  False  {'id': 96871, 'name': 'Father of the Bride Col...         0   

                                              genres  \
0  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   
1  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   
2  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   
3  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   
4                     [{'id': 35, 'name': 'Comedy'}]   

                               homepage     id    imdb_id original_language  \
0  http://toystory.disney.com/toy-story    862  tt0114709                en   
1                                  <NA>   8844  tt0113497         

# TF-IDF (Term Frequency-Inverse Document Frequency)

# Matrix Factorization