This project is designed to parse CSV files containing taxi trip data, identify and handle duplicate records, and bulk insert the valid records into an MS SQL database. The project leverages dependency injection for service management and follows best practices for configuration and error handling.
The database was configured as an RDS SQL Server on Amazon!
- Common: Contains common utilities and the DI (Dependency Injection) extension.
- Helpers: Provides helper methods for CSV parsing.
- Interfaces: Defines the interfaces for services used in the project.
- Models: Contains the data models.
- Services: Implements the services for CSV parsing, writing, and database operations.
- CsvHelper
- Microsoft.Extensions.Configuration
- Microsoft.Extensions.DependencyInjection
- System.Data.SqlClient
The project expects a configuration file appsettings.json
with the following structure:
{
"environmentVariables": {
"ConnectionString": "your-database-connection-string",
"csvUrl": "path-to-your-csv-file",
"duplicateFilePath": "path-to-save-duplicates.csv"
}
}
The table was created by this command:
CREATE TABLE TaxiTrips (
Id INT IDENTITY(1,1) PRIMARY KEY,
TpepPickupDatetime DATETIME NOT NULL,
TpepDropoffDatetime DATETIME NOT NULL,
PassengerCount INT NOT NULL,
TripDistance DECIMAL(10, 2) NOT NULL,
StoreAndFwdFlag NVARCHAR(3) NOT NULL,
PULocationID INT NOT NULL,
DOLocationID INT NOT NULL,
FareAmount DECIMAL(10, 2) NOT NULL,
TipAmount DECIMAL(10, 2) NOT NULL
);
- Ensure
appsettings.json
is properly configured. - Place a csv file at bin → Debug → .NET8.
- I will send a connectionString in a mail.
- Build the project. (or use a docker)
- Run the project.
If we had to work with 10GB file, for really large datasets, we might want to look into distributed processing frameworks like Apache Spark or Hadoop. These tools are designed to handle big data and can process large files more efficiently than a single machine.