AWS-Lambda-web-crawler

This project contains source code and supporting files for a serverless application that you can deploy with the SAM CLI. It includes the following files and folders.

webcrawler - Code crawling a website
emailphoneextractor - Code for the extracting email/phone
events - Invocation events that you can use to invoke the function.
template.yaml - A template that defines the application's AWS resources.

The application uses several AWS resources, including Lambda functions and a Step-Function. These resources are defined in the template.yaml file in this project. You can update the template to add AWS resources through the same deployment process that updates your application code.

Deploy the sample application

The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. It uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application's build environment and API.

To use the SAM CLI, you need the following tools.

SAM CLI - Install the SAM CLI
Node.js - Install Node.js 14.x, including the NPM package management tool.
Docker - Install Docker community edition

NOTE: Create a s3 bucket through aws console or cli as below
  $ aws s3 mb s3://your-unique-bucket-name
and update template.yaml where the lambda can store >the crawled pages
 Parameters: 
 AppBucketName: 
   Type: String
   Default: your-unique-bucket-name # bucket name has to >be globally unique

To build your application for the first time, run the following in your shell:

$ git clone https://github.com/theerakesh/lambda-web-crawler.git
$ cd lambda-web-crawler/
$ sam build

The first command will build the source of your application. The second command will package and deploy your application to AWS, with a series of prompts

Use the SAM CLI to build and test locally

Build

Build your application with the sam build command.

lambda-web-crawler$ sam build

The SAM CLI installs dependencies defined in hello-world/package.json, creates a deployment package, and saves it in the .aws-sam/build folder.

Invoke/Test

Test a single function by invoking it directly with a test event. An event is a JSON document that represents the input that the function receives from the event source. Test events are included in the events folder in this project.

Run functions locally and invoke them with the sam local invoke command.

lambda-web-crawler$ sam local invoke WebCrawlerFunction --event events/event.json

Deploy

lambda-web-crawler$ sam deploy --guided

Once deployed you can use the step function to invoke the lambda as follows

first provide a Input as event
start execution and u can see following out based on the steps fails or passes
fail
pass
Now you can select the state and see the respective input/output e.g. after selecting state EmailPhoneExtractor

Understanding how the lambdas works

First Lambda/webcrawler

It checks if the the website has already been fetched by checking if there's a folder by that domain if it exists it passes control over to second lambda

 const found = await s3.listObjectsV2({ Bucket: uploadParams.Bucket, Prefix: `${url}/`, MaxKeys: 1 }).promise()
 if (found.Contents.length > 0) {
   return {
     'statusCode': 200,
     'domain': url,
     'body': `domain has already been fetched ${JSON.stringify(found)}`
   }

First lambda webcrawler uses axios to fetch the website then it leverages cheerio to extract all the links in navbar

    const links = new Set()
    const data = await axios.get(`https://${url}`).data
    const $ = cheerio.load(data)
      // using cheerio selector
    $('nav a').each((i, e) => {
      links.add($(e).attr('href'))
    })

    ```

Now a simple for loop to fetch all the links asynchronously and write the content of them to a s3 bucket as an object with key as domain

 for (let i of links) {
   try {
     let response = await axios.get(i).data
     try {
         s3data = await s3.putObject(uploadParams).promise()
       } catch (e) {
         return {
           'statusCode': 400,
           'body': JSON.stringify(e)
         }
       }
     }
   } catch (e) {
       return {
         'statusCode': 400,
         'body': JSON.stringify(e)
       }
     } 
 }

Second Lambda/EmailPhoneExtractor

First it checks if the website has already been scanned for emails and passwords using dynamodb cache

try {
  const dbResult = await db.get({
    TableName: tableName, Key: {
      domain: url
    }
  }).promise()
  if (dbResult) {
    return {
      'statusCode': 200,
      'body': `emails: ${JSON.stringify(dbResult.Item.emails)}, phones: ${JSON.stringify(dbResult.Item.phones)}`
    }
  }
} catch (e) {
    return {
      'statusCode': 400,
      'body': `Problem with table ${tableName} error: ${JSON.stringify(e)}`
    }
}

If the record is not found it gets all the objects from S3 Bucket using key url

  let res = (await s3.listObjectsV2({ Bucket: process.env.S3BucketName, Prefix: `${url}/` }).promise()).Contents

Now it scan each page one by one using regex to find email/phone

const objectData = (await s3.getObject({ Bucket: process.env.S3BucketName, Key: r.Key }).promise()).Body.toString('utf-8')
let email_matches = objectData.match(emailRegex)
let phone_matches = objectData.match(phoneRegex)

Regex used for email

const emailRegex = /^(([^<>()[\]\\.,;:\s@"]+(\.[^<>()[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/g

Regex used for phone numbers

const phoneRegex = /(?<!\d)(\+ ?\d{1,2}[\s\u00A0]?)\(?\d{3}\)?[\s.-\u00A0]?\d{3}[\s.-\u00A0]?\d{4}(?!\d)/g
// I had to use negative look-behind (?<!\d) and negative lookahead (?!\d) to stop matching any random 10 digit occurences

Results found then are written to dynamodb table for faster access and then it returns the results

Todo

Add dynamodb caching
Add unit testing
Using AWS SQS to do a recursive web-crawling
Automate deletion of data in AWS S3 after a day or a specific period since web content may get stale
Implement CI/CD in github actions
Add Typescript support

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.aws-sam		.aws-sam
emailphoneextractor		emailphoneextractor
events		events
statemachine		statemachine
webcrawler		webcrawler
.gitignore		.gitignore
README.md		README.md
package.yaml		package.yaml
samconfig.toml		samconfig.toml
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS-Lambda-web-crawler

Deploy the sample application

Use the SAM CLI to build and test locally

Build

Invoke/Test

Deploy

Understanding how the lambdas works

First Lambda/webcrawler

Second Lambda/EmailPhoneExtractor

Todo

About

Releases

Packages

Languages

7dpk/lambda-email-phone-crawler

Folders and files

Latest commit

History

Repository files navigation

AWS-Lambda-web-crawler

Deploy the sample application

Use the SAM CLI to build and test locally

Build

Invoke/Test

Deploy

Understanding how the lambdas works

First Lambda/webcrawler

Second Lambda/EmailPhoneExtractor

Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages