Avenga Internship task

Task:

Here is a dataset: "https://www.kaggle.com/datasets/anandaramg/taxi-trip-data-nyc?resource=download". Using Apache Spark (Scala or Python), get the following report: |-- Vendor: string (nullable = true) - name of vendor |-- Payment Type: string (nullable = true) - name of payment type |-- Payment Rate: double (nullable = true) - payment_total / passengers count per vendor and payment type |-- Next Payment Rate: double (nullable = true) - next record(bigger) payment rate for vendor |-- Max Payment Rate: double (nullable = true) - max payment rate for vendor |-- Percents to next rate: string (nullable = true) - how many points (in percents) is necessary to achieve the next record payment rate

Explanation:

Rename data in columns VendorID and payment_type.
Creating a table with payment_rate per vendor and payment type, excluding unknown vendor and not respective data as null or month that isn't July.
Creating a table with avg payment_rate per day for vendor.
Calculate growth rate of payment_rate per day: growth rate = = (payment_rate today - payment_rate yesterday)/payment_rate yesterday.
Calculate avg growth rate per day and multiply it by 30 days in month to find month grow in percent (percents_to_next_rate column).
Creating a table with avg payment_rate per vendor per month and max_payment_rate.
Adding and editing data to the resulting table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Avenga Internship task

Files

README.md

Latest commit

History

README.md

File metadata and controls

Avenga Internship task