<a href="https://colab.research.google.com/github/AlessandraMayumi/python-generator/blob/main/python_generator_aws.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

I want to share how to develop a implementation from the high level problem to a solution from a scratch code and increasing complexity

## Problem
Generate a response containing how many empty cells for each header in a csv file stored in a s3 bucket. This process have to be in a aws lambda.

# Hands on


### Create a project
I started this python study project in *GitHub* repository https://github.com/AlessandraMayumi/python-generator. Then I created this Colab nootebook and saved a copy of this notebook on github.

### Reading a csv file with pandas
Before thinking about aws resources, let's try to retrieve the result from a local csv file as expected in the Prom description.

#### References
- w3schools: https://www.w3schools.com/python/pandas/pandas_csv.asp



In [None]:
import pandas as pd

df = pd.read_csv('/content/sample_data/california_housing_test.csv')
print(df.head())

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.05     37.37                27.0       3885.0           661.0   
1    -118.30     34.26                43.0       1510.0           310.0   
2    -117.81     33.78                27.0       3589.0           507.0   
3    -118.36     33.82                28.0         67.0            15.0   
4    -119.67     36.33                19.0       1241.0           244.0   

   population  households  median_income  median_house_value  
0      1537.0       606.0         6.6085            344700.0  
1       809.0       277.0         3.5990            176500.0  
2      1484.0       495.0         5.7934            270500.0  
3        49.0        11.0         6.1359            330000.0  
4       850.0       237.0         2.9375             81700.0  


### Prepare a test csv file
Generate a csv file and clear some cells

#### References
- python open - https://www.pythontutorial.net/python-basics/python-read-text-file/
- csv generator: https://extendsclass.com/csv-generator.html

In [None]:
import pandas as pd
import random

""" The test csv file should have some empty cells """

filename_test = 'mock_empty_register.csv'
filepath_test = f'/content/sample_data/{filename_test}'
filepath_original = '/content/sample_data/mock_register.csv'

def modify_empty_cells():
  df = pd.read_csv(filepath_original)

  num_row, num_col = df.shape

  for i in range(100):
    x = random.randrange(num_row)
    y = random.randrange(1, num_col)
    df.loc[x, df.columns[y]] = None

  df.to_csv(filepath_test)

  print(f'Test csv file generated: {filename_test}')


modify_empty_cells()


Test csv file generated: mock_empty_register.csv


## Working on the solution
After the test file is generated, let's work on the solution.

### Count empty cells
Pandas already have a easy way to do it.

### Convert dataframe into dict

In [None]:
df_test = pd.read_csv(filepath_test)

def get_empty_count(df):
  empty_count = df.isna().sum().to_dict()
  empty_count.pop('id')
  empty_count.pop('Unnamed: 0')
  print(f'Sucessfully count empty cells for each column: {empty_count}')
  return empty_count

response = get_empty_count(df_test)

Sucessfully count empty cells for each column: {'firstname': 22, 'lastname': 18, 'email': 21, 'email2': 22, 'profession': 17}


## Create a aws lambda
- add S3 as a trigger

### Add a Test
Instead of adding and removing the same csv file, print the lambda event in order to learn how the test can mock a new file in S3 bucket

Event Json is something like this:

```json
{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "2023-07-02T19:25:52.815Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "A25Y..."
      },
      "requestParameters": {
        "sourceIPAddress": "187.106.35.106"
      },
      "responseElements": {
        "x-amz-request-id": "XKK...",
        "x-amz-id-2": "PAT..."
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "7331....",
        "bucket": {
          "name": "my-bucket",
          "ownerIdentity": {
            "principalId": "A25..."
          },
          "arn": "arn:aws:s3:::my-bucket"
        },
        "object": {
          "key": "mock_empty_register.csv",
          "size": 81685,
          "eTag": "6fc5f1...",
          "sequencer": "0064..."
        }
      }
    }
  ]
}
```

In [None]:
import json

def lambda_handler(event, context):
    print(f'Lambda event: {event}')

    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }


### Add permission to read S3 object

Before accessing the S3 file, is necessary to configure the lambda to permit the get object operation.

Navigate through: *Monitor > Permissions*.
Then click on the lambda role and edit.
In the role page, there is a Permissions policies section, add the permission: **AmazonS3ReadOnlyAccess**


### Get object from S3 bucket and read as csv

Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python. To retrieve objects from S3 it is necessary to parse the handler event which contains the bucket name and object key. The response type for body is StreamingBody that requires a conversion for pandas be able to read it.

#### References
- aws sdk boto3 for python: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/get_object.html

In [None]:
import boto3
import io
import json
import pandas as pd

def lambda_handler(event, context):
    print(f'Lambda event: {event}')

    bucket = event['Records'][0]['s3']['bucket']['name']
    object_key = event['Records'][0]['s3']['object']['key']

    client = boto3.client('s3')
    obj = client.get_object(Bucket=bucket, Key=object_key)
    body_bytes = obj['Body'].read()
    body_io = io.BytesIO(body_bytes)

    df = pd.read_csv(body_io)
    empty_count = df.isna().sum().to_json()
    print(f'Sucessfully count empty cells for each column: {empty_count}')

    return {
        'statusCode': 200,
        'body': [empty_count]
    }

In [None]:
# import cProfiler
# import stats

# profiler = cProfile.Profile()
# profiler.enable()

# f()

# profiler.disable()
# stats.print_state(())