This project contains source code and supporting files for a serverless application that you can deploy with the SAM CLI. It includes the following files and folders.
- webcrawler - Code crawling a website
- emailphoneextractor - Code for the extracting email/phone
- events - Invocation events that you can use to invoke the function.
- template.yaml - A template that defines the application's AWS resources.
The application uses several AWS resources, including Lambda functions and a Step-Function. These resources are defined in the template.yaml
file in this project. You can update the template to add AWS resources through the same deployment process that updates your application code.
The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. It uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application's build environment and API.
To use the SAM CLI, you need the following tools.
- SAM CLI - Install the SAM CLI
- Node.js - Install Node.js 14.x, including the NPM package management tool.
- Docker - Install Docker community edition
NOTE: Create a s3 bucket through aws console or cli as below
$ aws s3 mb s3://your-unique-bucket-nameand update
template.yaml
where the lambda can store >the crawled pagesParameters: AppBucketName: Type: String Default: your-unique-bucket-name # bucket name has to >be globally unique
To build your application for the first time, run the following in your shell:
$ git clone https://github.com/theerakesh/lambda-web-crawler.git
$ cd lambda-web-crawler/
$ sam build
The first command will build the source of your application. The second command will package and deploy your application to AWS, with a series of prompts
Build your application with the sam build
command.
lambda-web-crawler$ sam build
The SAM CLI installs dependencies defined in hello-world/package.json
, creates a deployment package, and saves it in the .aws-sam/build
folder.
Test a single function by invoking it directly with a test event. An event is a JSON document that represents the input that the function receives from the event source. Test events are included in the events
folder in this project.
Run functions locally and invoke them with the sam local invoke
command.
lambda-web-crawler$ sam local invoke WebCrawlerFunction --event events/event.json
lambda-web-crawler$ sam deploy --guided
Once deployed you can use the step function to invoke the lambda as follows
-
start execution and u can see following out based on the steps fails or passes
-
Now you can select the state and see the respective
input/output
e.g. after selecting stateEmailPhoneExtractor
-
It checks if the the website has already been fetched by checking if there's a folder by that
domain
if it exists it passes control over to second lambdaconst found = await s3.listObjectsV2({ Bucket: uploadParams.Bucket, Prefix: `${url}/`, MaxKeys: 1 }).promise() if (found.Contents.length > 0) { return { 'statusCode': 200, 'domain': url, 'body': `domain has already been fetched ${JSON.stringify(found)}` }
-
First lambda
webcrawler
usesaxios
to fetch the website then it leveragescheerio
to extract all the links innavbar
const links = new Set() const data = await axios.get(`https://${url}`).data const $ = cheerio.load(data) // using cheerio selector $('nav a').each((i, e) => { links.add($(e).attr('href')) }) ```
-
Now a simple for loop to fetch all the links asynchronously and write the content of them to a s3 bucket as an object with
key
asdomain
for (let i of links) { try { let response = await axios.get(i).data try { s3data = await s3.putObject(uploadParams).promise() } catch (e) { return { 'statusCode': 400, 'body': JSON.stringify(e) } } } } catch (e) { return { 'statusCode': 400, 'body': JSON.stringify(e) } } }
- First it checks if the website has already been scanned for emails and passwords using
dynamodb
cachetry { const dbResult = await db.get({ TableName: tableName, Key: { domain: url } }).promise() if (dbResult) { return { 'statusCode': 200, 'body': `emails: ${JSON.stringify(dbResult.Item.emails)}, phones: ${JSON.stringify(dbResult.Item.phones)}` } } } catch (e) { return { 'statusCode': 400, 'body': `Problem with table ${tableName} error: ${JSON.stringify(e)}` } }
- If the record is not found it gets all the objects from S3 Bucket using key
url
let res = (await s3.listObjectsV2({ Bucket: process.env.S3BucketName, Prefix: `${url}/` }).promise()).Contents
- Now it scan each page one by one using regex to find
email/phone
const objectData = (await s3.getObject({ Bucket: process.env.S3BucketName, Key: r.Key }).promise()).Body.toString('utf-8') let email_matches = objectData.match(emailRegex) let phone_matches = objectData.match(phoneRegex)
- Regex used for email
const emailRegex = /^(([^<>()[\]\\.,;:\s@"]+(\.[^<>()[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/g
- Regex used for phone numbers
const phoneRegex = /(?<!\d)(\+ ?\d{1,2}[\s\u00A0]?)\(?\d{3}\)?[\s.-\u00A0]?\d{3}[\s.-\u00A0]?\d{4}(?!\d)/g // I had to use negative look-behind (?<!\d) and negative lookahead (?!\d) to stop matching any random 10 digit occurences
- Results found then are written to dynamodb table for faster access and then it returns the results
- Add dynamodb caching
- Add
unit testing
- Using
AWS SQS
to do a recursive web-crawling - Automate deletion of data in
AWS S3
after a day or a specific period since web content may get stale - Implement CI/CD in
github actions
- Add Typescript support