# Building fast queries on a CSV

In this project, We will use the *laptops.csv* file as our inventory. This CSV file was adapted from the [Laptop Prices dataset on Kaggle](https://www.kaggle.com/ionaskel/laptop-prices). The goal is to create a class that represents our inventory. The methods in that class will implement the queries that we want to answer about our inventory. We will also preprocess that data to make those queries run faster.

Here are some queries that we will want to answer:

- Given a laptop id, find the corresponding data.
- Given an amount of money, find whether there are two laptops whose total price is that given amount.
- Identify all laptops whose price falls within a given budget.

To achieve this purpose, we are going to create a new class, **Inventory**.

In [1]:
import csv
import chardet
import re

First of all, let's take a look at the data we have.

In [2]:
# # check files' encoding
# with open('laptops.csv', mode='rb') as file:
#     raw_bytes = file.read(32)
#     print(chardet(raw_bytes))
#     encodingname = chardet(raw_bytes)['encoding']    

In [3]:
with open('laptops.csv', encoding = 'utf8') as file:
    reader = list(csv.reader(file))
    header = reader[0]
    rows = reader[1:]

In [4]:
print('header\n', header, '\n')
print('rows\n', rows[:5])

header
 ['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price'] 

rows
 [['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339'], ['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898'], ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575'], ['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537'], ['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz'

## Create Inventory class and read file

First of all, we will create a new class to read the file and change the `price` column into **int**.

In [5]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
            
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])

In [6]:
inventory = Inventory('laptops.csv')
print(inventory.header)
print(len(inventory.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
1303


In [7]:
print(inventory.rows[2])

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]


## Find laptop details from ID

Then, we will write a function named *get_laptop_from_id()* to look up a laptop from a given identifier. In this way, when a customer comes to our store with a purchase slip, we can quickly identify the laptop to which it corresponds. This function will take as argument the identifier of the laptop and return the full row of the laptop with that id.

In [8]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
            
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'             

In [9]:
inventory = Inventory('laptops.csv')
print(inventory.get_laptop_from_id('3362737'))
print(inventory.get_laptop_from_id('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


This algorithm has time complexity O(R) where R is the number of rows.

## Improve complexity in finding with ID

We would like to reduce the complxity by proceprocessing the data into a dictionary where the keys are the IDs and the values the rows. Then, we will use that dictionary in the get_laptop_from_id() method. We can do this by:

In [10]:
class Inventory():
    
    def __init__(self, csv_filename):
        
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
           
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # create a dictionary with ID as the key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'           
    
    def get_laptop_from_id_fast(self, laptop_id):
        # laptop_id is the first column
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return 'None'           

In [11]:
inventory = Inventory('laptops.csv')
print(inventory.get_laptop_from_id_fast('3362737'))
print(inventory.get_laptop_from_id_fast('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


This algorithm has time complexity O(1).

## Compare performance

We would time to compare the performance of the functions *get_laptop_from_id* and *get_laptop_from_id_fast* with 10,000 random value between '1000000' and '9999999'.

In [12]:
import time
import random

In [13]:
# create random IDs
rand_ids = str([random.randint(1000000,9999999) for _ in range(10000)])

# initiate data
inventory = Inventory('laptops.csv')

# times of calling get_laptop_from_id
total_time_no_dict = 0

for id in rand_ids:
    start = time.time()
    inventory.get_laptop_from_id(id)
    end = time.time()
    total_time_no_dict += end-start

# time of calling get_laptop_from_id_fast
total_time_dict = 0

for id in rand_ids:
    start = time.time()
    inventory.get_laptop_from_id_fast(id)
    end = time.time()
    total_time_dict += end-start

In [14]:
print(
    '''
    total_time_no_dict: ', {}
    total_time_dict: ', {}
    '''.format(total_time_no_dict, total_time_dict)
)


    total_time_no_dict: ', 9.679114818572998
    total_time_dict: ', 0.04151296615600586
    


*get_laptop_from_id_fast* is more than 200 times faster *get_laptop_from_id*!

## Promotion check

In this part, we will write a function with a given dollar amount to check whether it is possible to spend precisely that amount by purchasing up to two laptops.

In [15]:
class Inventory():
    
    def __init__(self, csv_filename):
        
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
           
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # create a dictionary with ID as the key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'           
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return 'None'           
    
    def check_promotion_dollars(self, dollars):
        # match with single item
        for row in self.rows:
            if dollars == row[-1]:
                return True

        # match with pairs
        for row1 in self.rows:
            for row2 in self.rows:
                if dollars == row1[-1]+row2[-1]:
                    return True

        # no matches 
        return False

In [16]:
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars(1000))
print(inventory.check_promotion_dollars(442))

True
False


## Improve complexity in promotion check

In [17]:
class Inventory():
    
    def __init__(self, csv_filename):
        
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
           
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # create a dictionary with ID as the key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
        
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'           
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return 'None'           
    
    def check_promotion_dollars(self, dollars):
        # match with single item
        for row in self.rows:
            if dollars == row[-1]:
                return True

        # match with pairs
        for row1 in self.rows:
            for row2 in self.rows:
                if dollars == row1[-1]+row2[-1]:
                    return True

        # no matches 
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False

In [18]:
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars_fast(1000))
print(inventory.check_promotion_dollars_fast(442))

True
False


## Compare performance

We would time to compare the performance of the two functions with 100 random prices between 100 and 5000.

In [19]:
# create random price
rand_prices = [random.randint(100,5000) for _ in range(100)]

# initiate data
inventory = Inventory('laptops.csv')

# times of calling get_laptop_from_id
total_time_no_set = 0

for price in rand_prices:
    start = time.time()
    inventory.check_promotion_dollars(price)
    end = time.time()
    total_time_no_set += end-start

# time of calling get_laptop_from_id_fast
total_time_set = 0

for price in rand_prices:
    start = time.time()
    inventory.check_promotion_dollars_fast(price)
    end = time.time()
    total_time_set += end-start

In [20]:
print(
    '''
    total_time_no_set: ', {}
    total_time_set: ', {}
    '''.format(total_time_no_set, total_time_set)
)


    total_time_no_set: ', 3.2755239009857178
    total_time_set: ', 0.0011219978332519531
    


In [21]:
1500

1500

The method with set is more than 1500 times faster the method without using set!

## Find laptops within a budget

We allow the user to input a budget and return the the first row from a sorted table whose price is larger than the budget.

In [22]:
class Inventory():
    
    def __init__(self, csv_filename):
        
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
           
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # dictionary with ID as the key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
        
        # set of price
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
        
        # sorted data by price
        self.rows_by_price = sorted(self.rows, 
                                   key = lambda row: row[-1])

            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'           
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return 'None'           
    
    def check_promotion_dollars(self, dollars):
        # match with single item
        for row in self.rows:
            if dollars == row[-1]:
                return True

        # match with pairs
        for row1 in self.rows:
            for row2 in self.rows:
                if dollars == row1[-1]+row2[-1]:
                    return True

        # no matches 
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False
    
    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1    
        if target_price > self.rows_by_price[-1][-1]:
            return -1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            if price <= target_price:
                range_start = range_middle + 1             
            else:                                          
                range_end = range_middle 
        price = self.rows_by_price[range_start][-1]                             
        return range_end, self.rows_by_price[range_end]

In [23]:
inventory = Inventory('laptops.csv')
print(inventory.find_first_laptop_more_expensive(1000))
print(inventory.find_first_laptop_more_expensive(100000))

(683, ['8747948', 'Lenovo', 'ThinkPad T460', 'Notebook', '14', '1366x768', 'Intel Core i5 6200U 2.3GHz', '4GB', '508GB Hybrid', 'Intel HD Graphics 520', 'Windows 7', '1.70kg', 1002])
-1


## Find laptops within a budget in a range

Now we extend our budget query to take as input a range of prices, min_price and max_price, rather than a single price.

In [24]:
class Inventory():
    
    def __init__(self, csv_filename):
        
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
           
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # dictionary with ID as the key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
        
        # set of price
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
        
        # sorted data by price
        self.rows_by_price = sorted(self.rows, 
                                   key = lambda row: row[-1])

            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'           
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return 'None'           
    
    def check_promotion_dollars(self, dollars):
        # match with single item
        for row in self.rows:
            if dollars == row[-1]:
                return True

        # match with pairs
        for row1 in self.rows:
            for row2 in self.rows:
                if dollars == row1[-1]+row2[-1]:
                    return True

        # no matches 
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False
    
    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1    
        if target_price > self.rows_by_price[-1][-1]:
            return -1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            if price <= target_price:
                range_start = range_middle + 1             
            else:                                          
                range_end = range_middle 
        price = self.rows_by_price[range_start][-1]                             
        return self.rows_by_price[range_end]
    
    def find_laptop_in_range(self, min_price, max_price):
        print('Model with price between {} and {}:\n'.format(min_price, max_price))
        if min_price > self.rows_by_price[-1][-1] or max_price < self.rows_by_price[0][-1] or max_price < min_price:
            return -1
        for row in self.rows_by_price:
            if row[-1] > max_price:
                return '-End of serach-'
            if row[-1] > min_price:
                print(row, '\n')

In [25]:
inventory = Inventory('laptops.csv')
print(inventory.find_laptop_in_range(1000,1010))

Model with price between 1000 and 1010:

['8747948', 'Lenovo', 'ThinkPad T460', 'Notebook', '14', '1366x768', 'Intel Core i5 6200U 2.3GHz', '4GB', '508GB Hybrid', 'Intel HD Graphics 520', 'Windows 7', '1.70kg', 1002] 

['5550925', 'Dell', 'Latitude 5580', 'Notebook', '15.6', '1366x768', 'Intel Core i5 7300U 2.6GHz', '8GB', '500GB HDD', 'Intel HD Graphics 620', 'Windows 10', '1.9kg', 1008] 

['3667708', 'Acer', 'Aspire F5-573G-510L', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '12GB', '128GB SSD +  1TB HDD', 'Nvidia GeForce GTX 950M', 'Windows 10', '2.4kg', 1009] 

['8017281', 'Dell', 'Vostro 5568', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i7 7500U 2.7GHz', '8GB', '256GB SSD', 'Nvidia GeForce 940MX', 'Windows 10', '2.18kg', 1009] 

['6766298', 'Lenovo', 'Thinkpad 13', 'Notebook', '13.3', 'IPS Panel Full HD 1920x1080', 'Intel Core i7 7500U 2.7GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'Windows 10', '1.4kg', 1010] 

['9303831', 'HP', 'ProBook

## Find the cheapest laptop with specific RAM and capacity

Sometimes, a customer wants a laptop with some characteristics such as, for instance, 8GB or RAM and a 256GB hard drive. It would be interesting for those customers to provide a way to find the cheapest laptop that matches the desired characteristics. Un this case, we will focus only on the amount of RAM and hard drive capacity. 

In [73]:
class Inventory():
    
    def __init__(self, csv_filename):
        
        with open('laptops.csv', encoding = 'utf8') as file:            
            reader = list(csv.reader(file))
           
        self.header = reader[0]
        self.rows = reader[1:]
        
        # convert price into int
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # dictionary with ID as the key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
        
        # set of price
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
        
        # sorted data by price
        self.rows_by_price = sorted(self.rows, key = lambda row: row[-1])
        
        # convert RAM into int
        self.rows_by_price_int_spec = self.rows_by_price
        for row in self.rows_by_price_int_spec:
            ram_total = 0
            GB = re.findall(r'\d+(?=GB)', str(row[7]))
            TB = re.findall(r'\d+(?=TB)', str(row[7]))
            for ele in GB:
                ram_total += int(ele)
            for ele in TB:
                ram_total += int(ele)*1000
            row.append(ram_total)
        
        # convert Memory into int
        for row in self.rows_by_price_int_spec:
            memory_total = 0
            GB = re.findall(r'\d+(?=GB)', str(row[8]))
            TB = re.findall(r'\d+(?=TB)', str(row[8]))
            for ele in GB:
                memory_total += int(ele)
            for ele in TB:
                memory_total += int(ele)*1000
            row.append(memory_total)
            
        # set of ram
        self.ram = set()
        for row in self.rows_by_price_int_spec:
            self.ram.add(row[-2])
            
        # set of memory
        self.memory = set()
        for row in self.rows_by_price_int_spec:
            self.memory.add(row[-1])


            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            # laptop_id is the first column
            if row[0] == laptop_id:
                return row 
        return 'None'           
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return 'None'           
    
    def check_promotion_dollars(self, dollars):
        # match with single item
        for row in self.rows:
            if dollars == row[-1]:
                return True

        # match with pairs
        for row1 in self.rows:
            for row2 in self.rows:
                if dollars == row1[-1]+row2[-1]:
                    return True

        # no matches 
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False
    
    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1    
        if target_price > self.rows_by_price[-1][-1]:
            return -1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            if price <= target_price:
                range_start = range_middle + 1             
            else:                                          
                range_end = range_middle 
        price = self.rows_by_price[range_start][-1]                             
        return self.rows_by_price[range_end]
    
    def find_laptop_in_range(self, min_price, max_price):
        print('Model with price between {} and {}:\n'.format(min_price, max_price))
        if min_price > self.rows_by_price[-1][-1] or max_price < self.rows_by_price[0][-1] or max_price < min_price:
            return -1
        for row in self.rows_by_price:
            if row[-1] > max_price:
                return '-End of serach-'
            if row[-1] > min_price:
                print(row, '\n')

    # ram & momery are in GB
    def find_cheapest_laptop_with_spec(self, ram, memory):
        
        # serach for the closest value 
        if ram in self.ram:
            target_ram = int(ram)
        else:
            for match_ram in sorted(list(self.ram)):
                if int(match_ram) < int(ram):
                    target_ram = int(match_ram)
            print('{}GB is not avaliable. The best option will be {}GB.\n'.format(ram, target_ram))
                    
        if memory in self.memory:
            target_memory = memory
        else:
            for match_memory in sorted(list(self.memory)):
                if int(match_memory) < int(memory):
                    target_memory = int(match_memory)
            print('{}GB is not avaliable. The best option will be {}GB.\n'.format(memory, target_memory))
            
        for row in self.rows_by_price_int_spec:
            if (row[-2] == target_ram) & (row[-1] == target_memory):
                return row
        return -1

In [74]:
inventory = Inventory('laptops.csv')
print(inventory.find_cheapest_laptop_with_spec(128,1000))
# print(inventory.ram)

128GB is not avaliable. The best option will be 64GB.

['3335869', 'Asus', 'ROG G701VO', 'Gaming', '17.3', 'IPS Panel Full HD 1920x1080', 'Intel Core i7 6820HK 2.7GHz', '64GB', '1TB SSD', 'Nvidia GeForce GTX 980 ', 'Windows 10', '3.58kg', 3975, 64, 1000]
