# Data Collection
In this lab we will take a quick look at some simple data collection techniques.

## Data Collection

Ever thought about grabbing some cool data for your project? Sure, you could just download a ready-made dataset that fits your needs. But let's be real - sometimes, what you need is as unique as a unicorn in a field of horses.

So, what do you do when the perfect dataset is playing hard to get? You have two awesome choices:

1. **Become a Data Detective with APIs**: Think of an API as your personal data assistant. It's like saying, "Hey API, can you fetch me some data?" And voila, it gets the job done.

2. **DIY Dataset Creation**: Roll up your sleeves and create your dataset masterpiece. How? By using techniques like web scraping or crawling for example. It's like going on a treasure hunt on the internet!

### API
Use “**The Movie DB**” API to:
   1. Download data about movies.
   2. Search for movies in the “Comedy” genre released in the year 2000 or later. Retrieve the 300 most popular movies in this genre. The movies should be sorted from most popular to least popular. Hint: Sorting based on popularity can be done in the API call.
   3. For each comedy movie, download its first 5 similar movies. If a movie has fewer than 5 similar movies, the API will return as many as it can find. Your code should be flexible to work with however many movies the API returns.
    
For more information on retrieving movie data, visit the following The Movie DB API documentation pages:
   - [Movie Discover](https://developers.themoviedb.org/3/discover/movie-discover)
   - [Get Movie List](https://developers.themoviedb.org/3/genres/get-movie-list)
   - [Get Similar Movies](https://developers.themoviedb.org/3/movies/get-similar-movies)

#### Saving Results
   - **File Format**: For the comedy movies, save the results in `movie_ID_name.csv`. For the comedy movies similar movies, save the results in `movie_ID_sim_movie_ID.csv`.
   - **Format Specification**: For the comedy movies, each line should describe one movie in the format `movie-ID,movie-name` without any spaces after the comma and no column headers. <br/>**Example**: A line in the file could look like `353486,Jumanji: Welcome to the Jungle`.<br/> For the comedy movies' similar movies each line in the file should describe one pair of similar movies in the format `movie-ID,similar-movie-ID`, without any spaces after comma, and no column headers.<br/> **Example**: If `Jumanji: Welcome to the Jungle` which has ID, `353486` has 3 similar movies with IDs `A`, `B` and `C` respectively then the following lines should be added to `movie_ID_sim_movie_ID.csv`
        - `353486,A`
        - `353486,B`
        - `353486,C`

#### Notes
   - **Multiple API Calls**: You may need to make multiple calls to retrieve all 300 movies, possibly retrieving them page by page due to pagination.
   - **API Parameters**: Use the `primary_release_date` parameter for movies released in 2000 or later, instead of `release_date` to avoid incorrect returns.

#### Deliverables
   - **movie_ID_name.csv**: The text file that contains the output for 2.
   - **movie_ID_sim_movie_ID.csv**: The text file that contains the output for 3.


You will need an API key to get data from the TMDb. For the purposes of this lab we will simply include it in our notebook however you should **NOT** do this for anything other than personal projects saved locally. Best practice is to have the API key be inserted at runtime via a command line argument, environment variable etc.

#### How to Use TheMovieDB API
- **Create an Account**: Sign up at [https://www.themoviedb.org/account/signup](https://www.themoviedb.org/account/signup).
- **Request API Key**:
  1. Log in and go to **Settings**.
  2. Navigate to the **API** tab in the left panel.
  3. Request an API key by selecting “Developer” and accepting the terms.
  4. Fill out the form.
  5. Your API key will be available under the API tab.

##### Important Notes
- **API Documentation**: Refer to [TheMovieDB API Documentation](https://developers.themoviedb.org/3/getting-started/introduction) for guidance.
- **Rate Limiting**: The API allows 40 requests every 10 seconds. Set appropriate timeout intervals in your code.
- **Variable Results**: The API may return different results for the same request. Plan your script's run time accordingly.


In [1]:
# !pip install requests beautifulsoup4 python-csv
import requests
import csv

In [2]:
api_key = 'YOUR_API_KEY' # Replace 'YOUR_API_KEY' with your actual API key
base_url = 'https://api.themoviedb.org/3'

In [3]:
# your code here

## Web Scraping

- **Target Website**: Our goal is to scrape job listings and their details from [Fake Jobs](https://realpython.github.io/fake-jobs/).
- **Initial Step**: Start by opening the website in your browser to familiarize yourself with its layout and content.
- **Understanding HTML Structure**:
  - In Chrome, to understand the page's HTML structure, go to: 
    - `Options` -> `More Tools` -> `Developer Tools`.
  - This will help in planning our scraping strategy effectively.


Here, we set up our web scraping environment by importing necessary libraries and defining the URL of the website we want to scrape.

In [4]:
import requests
from bs4 import BeautifulSoup

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

In [5]:
# your code here