https://www.tutorialspoint.com/how-to-use-boto3-to-start-a-crawler-in-aws-glue-data-catalog


Step 1: Import boto3 and botocore exceptions to handle exceptions

Step 2: crawler_name is the parameter in this function.

Step 3: Create an AWS session using boto3 lib. Make sure region_name is mentioned in the default profile. If it is not mentioned, then explicitly pass the region_name while creating the session.

Step 4: Create an AWS client for glue.

Step 5: Now use the start_crawler function and pass the parameter crawler_name as Name.

Step 6: It returns the response metadata and starts the crawler irrespective of its schedule. If the status of crawler is running, then it throws CrawlerRunningException.

Step 7: Handle the generic exception if something went wrong while starting a crawler.

In [1]:
import getpass

In [3]:
accessKeyID = getpass.getpass()

 ····················


In [5]:
secretAccessKeyID = getpass.getpass()

 ········································


In [6]:
import boto3
from botocore.exceptions import ClientError

The following code starts an already existing crawler in AWS Glue Data Catalog

In [8]:

def start_a_crawler(crawler_name):
    session = boto3.session.Session(aws_access_key_id=accessKeyID, aws_secret_access_key=secretAccessKeyID)
    glue_client = session.client('glue')
    try:
        response = glue_client.start_crawler(Name=crawler_name)
        return response
    except ClientError as e:
        raise Exception("boto3 client error in start_a_crawler: " + e.__str__())
    except Exception as e:
        raise Exception("Unexpected error in start_a_crawler: " + e.__str__())

#1st time start the crawler
print(start_a_crawler("Data Dimension"))
#2nd time run, before crawler completes the operation
print(start_a_crawler("Data Dimension"))

Exception: boto3 client error in start_a_crawler: An error occurred (EntityNotFoundException) when calling the StartCrawler operation: Crawler with name Data Dimension does not exist

In [11]:
session = boto3.session.Session(aws_access_key_id=accessKeyID, aws_secret_access_key=secretAccessKeyID)
glue_client = session.client('glue', region_name='us-east-1')

https://hands-on.cloud/working-with-aws-glue-in-python-using-boto3/#h-creating-an-aws-glue-crawler


In [13]:
import json

response = glue_client.create_crawler(Name='CrawlerAZ1',
                                      Role='GlueFullAccess',
                                      DatabaseName='myGlueDb1',
                                      Targets={ 
                                          'S3Targets': [
                                              {
                                                  'Path': 'string',
                                                  'Exclusions': [
                                                      'string',
                                                  ],
                                                  'ConnectionName': 'string',
                                                  'SampleSize': 123,
                                                  'EventQueueArn': 'string',
                                                  'DlqEventQueueArn': 'string'
                                              },
                                          ],
                                      },
                                      Schedule='cron(15 12 * * ? *)',
                                      SchemaChangePolicy={
                                          'UpdateBehavior': 'UPDATE_IN_DATABASE',
                                          'DeleteBehavior': 'DEPRECATE_IN_DATABASE'
                                      },
                                      RecrawlPolicy={
                                          'RecrawlBehavior': 'CRAWL_EVERYTHING'
                                      },
                                      LineageConfiguration={
                                          'CrawlerLineageSettings': 'DISABLE'
                                      })

InvalidInputException: An error occurred (InvalidInputException) when calling the CreateCrawler operation: Service is unable to assume role arn:aws:iam::877061436404:role/GlueFullAccess. Please verify role's TrustPolicy

In [12]:
??glue_client.create_crawler

[1;31mSignature:[0m [0mglue_client[0m[1;33m.[0m[0mcreate_crawler[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Creates a new crawler with specified targets, role, configuration, and optional schedule. At least one crawl target must be specified, in the ``s3Targets`` field, the ``jdbcTargets`` field, or the ``DynamoDBTargets`` field.



See also: `AWS API Documentation <https://docs.aws.amazon.com/goto/WebAPI/glue-2017-03-31/CreateCrawler>`_


**Request Syntax** 
::

  response = client.create_crawler(
      Name='string',
      Role='string',
      DatabaseName='string',
      Description='string',
      Targets={
          'S3Targets': [
              {
                  'Path': 'string',
                  'Exclusions': [
                      'string',
                  ],
                  'ConnectionName': 'string',
                  'SampleSize': 123,
                  'EventQueueA