# Building Fast Queries on a CSV

Our goal is to build a way to answer a few different business questions about our inventory. we will use "laptop.csv" file as our inventory. This CSV file was adapted from the [Laptop Prices dataset on Kaggle](https://www.kaggle.com/datasets/ionaskel/laptop-prices). We changed the IDs and made the prices integers. 

Here is a brief description of the row : 

- ID: A unique identifier for the laptop.
- Company: The name of the company that produces the laptop.
- Product: The name of the laptop.
- TypeName: The type of laptop.
- Inches: The size of the screen in inches.
- ScreenResolution: The resolution of the screen.
- CPU: The laptop CPU.
- RAM: The amount of RAM in the laptop.
- Memory: The size of the hard drive.
- GPU: The graphics card name.
- OpSys: The name of the operating system.
- Weight: The laptop weight.
- Price: The price of the laptop.

## Reading file

In [1]:
# Reading file

import csv

with open("laptops.csv", encoding = "UTF-8") as file : 
    file = list(csv.reader(file))
    header = file[0]
    rows = file[1:]

print(header)
print("\n")
print(rows[:5])

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


[['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339'], ['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898'], ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575'], ['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537'], ['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256G

The goal of this project is to greate a class that represennts our inventory. The methods in that class will implement the queries that we want to answer about our inventory. Here are some queries that we will want to answer : 

- Given a laptop id, find the corresponding data.
- Given an amount of money, find whether there are two laptops whose total prices is that given amount
- Identify all laptops whose price falls within a given budget.

## Making Class Inventory

In [2]:
class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])

In [3]:
inventory = Inventory("laptops.csv")
print(inventory.header)
print("\n")
print(f"Length of dataset : {len(inventory.rows)}")

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


Length of dataset : 1303


## Method finding information about specific laptop_id

The first thing that we will implement is a way to look up a laptop from a given identifier. In this way, when a customer comes to our store with a purchase slip, we can quickly identify the laptop to which it corresponds. For this, we will write a function get_laptop_from_id(). This function will take as argument the identifier of the laptop and return the full row of the laptop with that id. 

In [4]:
class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None

In [5]:
inventory = Inventory("laptops.csv")
print(f"The result of query '3362737' : {inventory.get_laptop_from_id('3362737')}")
print(f"The result of query '3362736' : {inventory.get_laptop_from_id('3362736')}")

The result of query '3362737' : ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
The result of query '3362736' : None


This algorithm requires us to look at every single row to find the one that we are looking for. This algorithm has time complexity O(N). Howver, by using a set, we can check in constant time whether a given identifier exists. 

We will use a dictionary instead of a set. The idea is preprocesse the data into a dictionary where the keys are the IDs and the values the rows. 

In [6]:
class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None

In [7]:
inventory = Inventory("laptops.csv")
print(f"The result of query '3362737' : {inventory.get_laptop_from_id_fast('3362737')}")
print(f"The result of query '3362736' : {inventory.get_laptop_from_id_fast('3362736')}")

The result of query '3362737' : ['HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
The result of query '3362736' : None


## Measure of time excution between two algorithms

The get_laptop_from_id() method has time complexity O(N) where N is the number of rows. In contrast, the new imploementation as time complexity O(1). It does by using more memory to store the self.id_to_row dictionary and using a bit more times creating an instance. Let's experiment to compare the performance of the two methods. 

In [8]:
import time
import random 

ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]

class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None  

In [9]:
inventory = Inventory("laptops.csv")
total_time_no_dict = 0    # This variable will aggregate the times of calling get_laptop_from_id()

for lap_id in ids : 
    start = time.time()
    inventory.get_laptop_from_id(lap_id)
    end = time.time()
    total_time_no_dict += end-start 
    
total_time_dict = 0    # This variable will aggregate the times of calling get_laptop_from_id_fast()

for lap_id in ids : 
    start = time.time()
    inventory.get_laptop_from_id_fast(lap_id)
    end = time.time()
    total_time_dict += end-start

In [10]:
print(f"Total time of get_laptop_from_id() : {total_time_no_dict}")
print(f"Total time of get_laptop_from_id_fast() : {total_time_dict}")

Total time of get_laptop_from_id() : 3.390949249267578
Total time of get_laptop_from_id_fast() : 0.004830121994018555


Execution times of get_laptop_from_id_fast is faster than get_laptop_from_id about 3000 times. 

## Purchasing up to two laptops spend exact dollars 

Sometimes, our store offers a promotion where you give a gift card. A customer can use the gift to buy up to two laptops. To avoid having to keep track of what was already spent, the gift card has a single time usage. This means that, even if there is leftover money, it cannot be used anymore.

Our customers might feel cheated when no matter how they spend their gift card, they cannot spend the full gift card usage. we don't want to make a customer feel cheated, so whenever we issue a gift card, we want to make sure that there is at least one way to spend it in full. 

In [11]:
class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None  
    
    def check_promotion_dollars(self, dollars) : 
        for row in self.rows : 
            if dollars == row[-1] : 
                return True 
        for row1 in self.rows : 
            for row2 in self.rows : 
                if row1[-1] + row2[-1] == dollars :
                    return True 
        return False 

In [12]:
inventory = Inventory('laptops.csv')
print(f"Can I buy a laptop with 1000 dollars? : {inventory.check_promotion_dollars(1000)}")
print(f"Can I buy a laptop with 1000 dollars? : {inventory.check_promotion_dollars(442)}")

Can I buy a laptop with 1000 dollars? : True
Can I buy a laptop with 1000 dollars? : False


For more faster execution, we can store all laptops prices in a set when we initialize the inventory. Then we can check in constant time whether there is a laptop with a given price.

In [13]:
class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
            
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
            
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
        self.prices = set()
        for row in self.rows :
            price = row[-1]
            self.prices.add(price)
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None  
    
    def check_promotion_dollars(self, dollars) : 
        for row in self.rows : 
            if dollars == row[-1] : 
                return True 
        for row1 in self.rows : 
            for row2 in self.rows : 
                if row1[-1] + row2[-1] == dollars :
                    return True 
        return False 
    
    def check_promotion_dollars_fast(self, dollars) : 
        if dollars in self.prices : 
            return True
        for price in self.prices :
            left_dollars = dollars - price
            if left_dollars in self.prices : 
                return True
        return False 

In [14]:
inventory = Inventory('laptops.csv')
print(f"Can I buy a laptop with 1000 dollars? : {inventory.check_promotion_dollars_fast(1000)}")
print(f"Can I buy a laptop with 1000 dollars? : {inventory.check_promotion_dollars_fast(442)}")

Can I buy a laptop with 1000 dollars? : True
Can I buy a laptop with 1000 dollars? : False


## Measure of time excution between two algorithms

In [15]:
import time
import random 

prices = [random.randint(100, 5000) for _ in range(100)]

class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
            
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
            
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
        self.prices = set()
        for row in self.rows :
            price = row[-1]
            self.prices.add(price)
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None  
    
    def check_promotion_dollars(self, dollars) : 
        for row in self.rows : 
            if dollars == row[-1] : 
                return True 
        for row1 in self.rows : 
            for row2 in self.rows : 
                if row1[-1] + row2[-1] == dollars :
                    return True 
        return False 
    
    def check_promotion_dollars_fast(self, dollars) : 
        if dollars in self.prices : 
            return True
        for price in self.prices :
            left_dollars = dollars - price
            if left_dollars in self.prices : 
                return True
        return False 

In [16]:
inventory = Inventory("laptops.csv")
total_time_no_set = 0    # This variable will aggregate the times of calling get_laptop_from_id()

for price in prices : 
    start = time.time()
    inventory.check_promotion_dollars(price)
    end = time.time()
    total_time_no_set += end-start 
    
total_time_set = 0    # This variable will aggregate the times of calling get_laptop_from_id_fast()

for price in prices : 
    start = time.time()
    inventory.check_promotion_dollars_fast(price)
    end = time.time()
    total_time_set += end-start

In [17]:
print(f"Total time of check_promotion_dollars() : {total_time_no_set}")
print(f"Total time of check_promotion_dollars_fast() : {total_time_set}")

Total time of check_promotion_dollars() : 1.9592435359954834
Total time of check_promotion_dollars_fast() : 0.0008885860443115234


Excution time of check_promotion_dollars_fast is faster than check_promotion_dollars about 10,000 times.

## Finding Laptops within Budgets

We want to write a method that efficiently answers the query : Given a budget of D dollars, find all laptops whose price it at most D. 

In [18]:
def row_price(row):
    return row[-1]

class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
            
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
            
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
        self.prices = set()
        for row in self.rows :
            price = row[-1]
            self.prices.add(price)
            
        self.rows_by_price = sorted(self.rows, key = row_price)
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None  
    
    def check_promotion_dollars(self, dollars) : 
        for row in self.rows : 
            if dollars == row[-1] : 
                return True 
        for row1 in self.rows : 
            for row2 in self.rows : 
                if row1[-1] + row2[-1] == dollars :
                    return True 
        return False 
    
    def check_promotion_dollars_fast(self, dollars) : 
        if dollars in self.prices : 
            return True
        for price in self.prices :
            left_dollars = dollars - price
            if left_dollars in self.prices : 
                return True
        return False 
    
    def find_laptop_with_price(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1                       
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            if price == target_price:                            
                return range_middle                        
            elif price < target_price:                           
                range_start = range_middle + 1             
            else:                                          
                range_end = range_middle - 1 
        price = self.rows_by_price[range_start][-1]
        if price != target_price:                  
            return -1                                      
        return range_start
    
    def find_first_laptop_more_expensive(self, target_price) : 
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end : 
            range_middle = (range_end + range_start) // 2
            price = self.rows_by_price[range_middle][-1]    # Check price of the middle range of rows_by price
            if price > target_price : 
                range_end = range_middle
            else : 
                range_start = range_middle + 1
        price = self.rows_by_price[range_start][-1]
        if price <= target_price : 
            return -1 
        return range_start 

In [19]:
inventory = Inventory('laptops.csv')
print(inventory.find_first_laptop_more_expensive(1000))
print(inventory.find_first_laptop_more_expensive(10000))

683
-1


We can buy 682 laptops with 1,000 dollars budget and can buy all laptops with 10,000 dollars.

## Finding Laptops within Budgets range

In [51]:
def row_price(row):
    return row[-1]

class Inventory() : 
    
    def __init__(self, csv_filename) : 
        with open(csv_filename, encoding = "UTF-8") as file : 
            file = list(csv.reader(file))
            header = file[0]
            rows = file[1:]
            
        self.header = header
        self.rows = rows 
        for row in self.rows : 
            row[-1] = int(row[-1])
            
        self.id_to_row = {}
        for row in self.rows : 
            row_id = row[0]
            row_value = row[1:]
            self.id_to_row[row_id] = row_value
            
        self.prices = set()
        for row in self.rows :
            price = row[-1]
            self.prices.add(price)
            
        self.rows_by_price = sorted(self.rows, key = row_price)
            
    def get_laptop_from_id(self, laptop_id) : 
        for row in self.rows : 
            if laptop_id in row : 
                return row 
        return None
            
    def get_laptop_from_id_fast(self, laptop_id) : 
        if laptop_id in self.id_to_row : 
            return self.id_to_row[laptop_id]
        return None  
    
    def check_promotion_dollars(self, dollars) : 
        for row in self.rows : 
            if dollars == row[-1] : 
                return True 
        for row1 in self.rows : 
            for row2 in self.rows : 
                if row1[-1] + row2[-1] == dollars :
                    return True 
        return False 
    
    def check_promotion_dollars_fast(self, dollars) : 
        if dollars in self.prices : 
            return True
        for price in self.prices :
            left_dollars = dollars - price
            if left_dollars in self.prices : 
                return True
        return False 
    
    def find_laptop_with_price(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1                       
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            if price == target_price:                            
                return range_middle                        
            elif price < target_price:                           
                range_start = range_middle + 1             
            else:                                          
                range_end = range_middle - 1 
        price = self.rows_by_price[range_start][-1]
        if price != target_price:                  
            return -1                                      
        return range_start
    
    def find_first_laptop_more_expensive(self, target_price) : 
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end : 
            range_middle = (range_end + range_start) // 2
            price = self.rows_by_price[range_middle][-1]    # Check price of the middle range of rows_by price
            if price > target_price : 
                range_end = range_middle
            else : 
                range_start = range_middle + 1
        price = self.rows_by_price[range_start][-1]
        if price <= target_price : 
            return -1 
        return range_start 
    
    def find_last_laptop_less_expensive(self, target_price) : 
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end : 
            range_middle = (range_end + range_start) // 2
            price = self.rows_by_price[range_middle][-1]
            if price < target_price :
                range_start = range_middle + 1
            else : 
                range_end = range_middle
        price = self.rows_by_price[range_start][-1]
        if price > target_price : 
            return len(self.row_by_price)
        return range_start
    
    def find_laptops_between_range(self, min_price, max_price) : 
        max_index = self.find_last_laptop_less_expensive(max_price) 
        min_index = self.find_first_laptop_more_expensive(min_price) 
        return max_index - min_index + 1

In [52]:
# Test working for find_last_laptop_less_expensive

inventory = Inventory('laptops.csv')
print(inventory.find_first_laptop_more_expensive(1000))
print(inventory.find_last_laptop_less_expensive(1000))

683
682


In [53]:
inventory = Inventory('laptops.csv')
print(inventory.find_laptops_between_range(1000, 10000))

620


Available laptops between 1,000 dollars and 10,000 dollars are 620.