# Building Fast Queries on a CSV

In this project, we will look at a laptop dataset containing about 1300 models to answer a few business questions about our inventory. Our goal is to create a class that represents our inventory and uses methods to implement queries.

We will answer the following questions:
- Given a laptop Id, find the corresponding data.
- Given an amount of money, find whether there are two laptops whose total price is that given amount.
- Identify all laptops whose price falls within a given budget.

## Loading Laptop Dataset

In [1]:
import csv
with open('laptops.csv') as f:
    data = list(csv.reader(f))
    header = data[0]
    rows = data[1:]

    print(header, '\n')
    print(rows[:5])

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price'] 

[['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339'], ['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898'], ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575'], ['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537'], ['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256G

# Creating the Class

In [2]:
class Inventory():
    def __init__(self, csv_filename):  # Implement the constructor

        with open(csv_filename) as f:
            read = list(csv.reader(f))
        self.header = read[0]
        self.rows = read[1:]
        for row in self.rows:
            row[-1] = int(row[-1])  # Converts price values to int


data = Inventory('laptops.csv')  # Instantiate Inventory class
print(data.header, '\n')
print(len(data.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price'] 

1303


## Finding a Laptop From its Id

In [3]:
class Inventory():
    def __init__(self, csv_filename):

        with open(csv_filename) as f:
            read = list(csv.reader(f))
        self.header = read[0]
        self.rows = read[1:]
        for row in self.rows:
            row[-1] = int(row[-1])

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None


d = Inventory('laptops.csv')
# Test new method with given Id's
print(d.get_laptop_from_id('3362737'), '\n')
print(d.get_laptop_from_id('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575] 

None


## Improving Id Lookups

In [4]:
class Inventory():
    def __init__(self, csv_filename):

        with open(csv_filename) as f:
            read = list(csv.reader(f))
        self.header = read[0]
        self.rows = read[1:]
        for row in self.rows:
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = [row]

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None

    # Faster method to retrieve row given identifier is in dict
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None


d = Inventory('laptops.csv')
print(d.get_laptop_from_id_fast('3362737'), '\n')
print(d.get_laptop_from_id_fast('3362736'))

[['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]] 

None


## Comparing the Performance of Both Methods

Our first method, `get_laptop_from_id()`, has a time complexity of *O(n)* where *n* is the number of rows. Our second method, `get_laptop_from_id_fast()`, has a time complexity of *O(1)* because we were able to preprocess the data before checking whether or not the Id exists.

In order to confirm that this is true, we will compare performance times between both methods.

In [5]:
import time
import random

ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]
d = Inventory('laptops.csv')

total_time_no_dict = 0
for identifier in ids:
    start = time.time()
    d.get_laptop_from_id(identifier)
    end = time.time()
    total_time_no_dict += end - start

total_time_dict = 0
for identifier in ids:
    start = time.time()
    d.get_laptop_from_id_fast(identifier)
    end = time.time()
    total_time_dict += end - start

print(total_time_no_dict, total_time_dict)
print(total_time_no_dict / total_time_dict)

1.180232286453247 0.003981351852416992
296.44008623270855


As we can see, our `get_laptop_from_id()` method takes about 1.1802 seconds to execute, while our `get_laptop_from_id_fast()` method takes about 0.0040 seconds to execute. This means that our last method is about 296 times faster than our first method. While this may not be a big deal since our dataset only contains 1303 rows, our last method would greatly improve performance time if our dataset instead contained millions of rows of data.

## Two Laptop Promotion

In our imaginary store, we offer a promotional gift card which customers can use to buy up to two laptops. However, this gift card can only be used once. We want to find out if it is possible that, given a dollar amount, a customer can spend that precise amount to purchase up to two laptops. For this, we will create a new method called `check_promotion_dollars()`.

In [6]:
class Inventory():
    def __init__(self, csv_filename):

        with open(csv_filename) as f:
            read = list(csv.reader(f))
        self.header = read[0]
        self.rows = read[1:]
        for row in self.rows:
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = [row]

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None

    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        for i in self.rows:
            for j in self.rows:
                if i[-1] + j[-1] == dollars:
                    return True
        return False


d = Inventory('laptops.csv')
print(d.check_promotion_dollars(1000))
print(d.check_promotion_dollars(442))

True
False


## Optimizing Laptop Promotion Method

In [7]:
class Inventory():
    def __init__(self, csv_filename):

        with open(csv_filename) as f:
            read = list(csv.reader(f))
        self.header = read[0]
        self.rows = read[1:]
        for row in self.rows:
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = [row]
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None

    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        for i in self.rows:
            for j in self.rows:
                if i[-1] + j[-1] == dollars:
                    return True
        return False

    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False


d = Inventory('laptops.csv')
print(d.check_promotion_dollars_fast(1000))
print(d.check_promotion_dollars_fast(442))

True
False


## Comparing Promotion Methods

In [8]:
prices = [random.randint(100, 5000) for _ in range(100)]
d = Inventory('laptops.csv')

total_time_no_set = 0
for price in prices:
    start = time.time()
    d.check_promotion_dollars(price)
    end = time.time()
    total_time_no_set += end - start

total_time_set = 0
for price in prices:
    start = time.time()
    d.check_promotion_dollars_fast(price)
    end = time.time()
    total_time_set += end - start

print(total_time_no_set, total_time_set)
print(total_time_no_set / total_time_set)

3.600724220275879 0.0016040802001953125
2244.728299643282


As we can see, our `check_promotion_dollars()` method takes about 3.6007 seconds to execute, while our `check_promotion_dollars_fast()` method takes about 0.0016 seconds to execute. As a result, our last method is about 2,244 times faster than our first method.

## Finding Laptops Within a Budget

In [9]:
def row_price(row):
    return row[-1]


class Inventory():
    def __init__(self, csv_filename):

        with open(csv_filename) as f:
            read = list(csv.reader(f))
        self.header = read[0]
        self.rows = read[1:]
        for row in self.rows:
            row[-1] = int(row[-1])
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = [row]
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
        self.rows_by_price = sorted(self.rows, key=row_price)

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None

    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        for i in self.rows:
            for j in self.rows:
                if i[-1] + j[-1] == dollars:
                    return True
        return False

    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False

    def find_laptop_with_price(self, target_price):
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2
            value = self.rows_by_price[range_middle][-1]
            if value == target_price:
                return range_middle
            elif value < target_price:
                range_start = range_middle + 1
            else:
                range_end = range_middle - 1
        if self.rows_by_price[range_start][-1] != target_price:
            return -1
        return range_start

    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2
            price = self.rows_by_price[range_middle][-1]
            if price > target_price:
                range_end = range_middle
            else:
                range_start = range_middle + 1
        if self.rows_by_price[range_start][-1] <= target_price:
            return -1
        return range_start

In [10]:
# Testing new method
d = Inventory('laptops.csv')
print(d.find_first_laptop_more_expensive(1000))
print(d.find_first_laptop_more_expensive(10000))  # Doesn't exist

683
-1


## Conclusion

In this project we were able to successfully implement a class to perform queries based on our laptop data. With these queries we were able to find laptops given their Id number as well as find laptops that fit certain budgets.

Additionally, we were able to analyze different time complexities to optimize our methods further and perform faster queries based on our data. We could improve these methods even further for more complex queries that we would like to run with our data in the future.