### Project: Building Fast Queries on CSV's

---

__The aim is to create a class that represents the inventory.__ 

The methods in that class will implement the queries that we want to answer about our inventory. We will also preprocess that data to make those queries run faster.

---

__Through this project we will look to answer:__

- Given a laptop id, find the corresponding data.

- Given an amount of money, find whether there are two laptops whose total price is that given amount.

- Identify all laptops whose price falls within a given budget.

---

__We will use the laptops.csv file as our inventory. This CSV file was adapted from the Laptop Prices dataset on Kaggle.__
 
Data columns:

- ID: A unique identifier for the laptop.
- Company: The name of the company that produces the laptop.
- Product: The name of the laptop.
- TypeName: The type of laptop.
- Inches: The size of the screen in inches.
- ScreenResolution: The resolution of the screen.
- CPU: The laptop CPU.
- RAM: The amount of RAM in the laptop.
- Memory: The size of the hard drive.
- GPU: The graphics card name.
- OpSys: The name of the operating system.
- Weight: The laptop weight.
- Price: The price of the laptop.


In [1]:
import csv
with open('laptops.csv') as file:
    master = list(csv.reader(file))
    header = master[0]
    rows = master[1:]

In [2]:
print(header)

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


In [3]:
[print(i, '\n') for i in rows[:5]]

['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339'] 

['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898'] 

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575'] 

['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537'] 

['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256GB SSD', 'Intel Iris Plus Graphics 650', 'macOS', '1.37kg', '1803'] 



[None, None, None, None, None]

In [4]:
# 1st Inventory class
# Implement the class constructor
# Takes name of the CSV file as argument and reads rows

class Inventory():
    import csv
    def __init__(self, csv_filename, encoding = 'utf8'):
        
        # Open file as list of lists
        with open(csv_filename) as file:
            master = list(csv.reader(file))
            self.header = master[0]
            self.rows = master[1:]
            
            # Price from str to int
            for row in self.rows:
                row[12] = int(row[12])

In [5]:
check = Inventory('laptops.csv')
print(check.header)
print('Leneth of check.rows:', len(check.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
Leneth of check.rows: 1303


In [6]:
# 2nd Inventory class 
# Add get_laptop_from_id() method
# Implement way to look up laptop from a given identifier
# If customer comes to store with receipt, can find laptop with identifier code

class Inventory():
    import csv
    
    def __init__(self, csv_filename, encoding = 'utf8'):
        
        # Open file as list of lists
        with open(csv_filename) as file:
            master = list(csv.reader(file))
            self.header = master[0]
            self.rows = master[1:]
            
            # Price from str to int
            for row in self.rows:
                row[12] = int(row[12])
    
    def get_laptop_from_id(self, laptop_id):
        if type(laptop_id) == int:
            laptop_id = str(laptop_id)
        for row in self.rows:
            if row[0] == laptop_id:
                return row

In [7]:
check_2 = Inventory('laptops.csv')
# Test on a match
print(check_2.get_laptop_from_id(3362737))
# Test on a non-match
print(check_2.get_laptop_from_id(3362736))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


In [8]:
# 3rd Inventory class 
# Add data pre-processing to __init__()
# Improve ID lookup function time complexity from O(R) to O(1) by
# Pre-processing data from list to dictionary
# Dictionary is chosen over set as need to also retrieve remaining row info

class Inventory():
    import csv
    
    def __init__(self, csv_filename, encoding = 'utf8'):
        
        # Open file as list of lists
        with open(csv_filename) as file:
            master = list(csv.reader(file))
            self.header = master[0]
            self.rows = master[1:]
            
            # Price from str to int
            for row in self.rows:
                row[12] = int(row[12])
        
        #Preprocess data into dict with laptop id as key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row.update({row[0]: row[1:]})
    
    def get_laptop_from_id(self, laptop_id):
        if type(laptop_id) == int:
            laptop_id = str(laptop_id)
        for row in self.rows:
            if row[0] == laptop_id:
                return row
    
    def get_laptop_from_id_fast(self, laptop_id):
        if type(laptop_id) != str:
            laptop_id = str(laptop_id)
            
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]

In [9]:
check_3 = Inventory('laptops.csv')
print(check_3.get_laptop_from_id_fast(3362737))
print(check_3.get_laptop_from_id_fast('3362736'))

['HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


In [10]:
# Measure execution times of: 
#   get_laptop_from_id and get_laptop_from_id_fast 
import time, random
ids = [str(random.randint(9999999, 10000000)) for _ in range(10001)]
check_4 = Inventory('laptops.csv')

total_time_no_dict = 0
for i in ids:
    start = time.time()
    check_4.get_laptop_from_id(i)
    end = time.time()
    run_t = end - start
    total_time_no_dict += run_t

total_time_dict = 0
for i in ids:
    start = time.time()
    check_4.get_laptop_from_id_fast(i)
    end = time.time()
    run_t = end - start
    total_time_dict += run_t
    
print('Time without id dictionary:', round(total_time_no_dict, 2))
print('Time with id dictionary:', round(total_time_dict, 2))
print('Times faster with dictionary:', round(total_time_no_dict / total_time_dict))

Time without id dictionary: 0.58
Time with id dictionary: 0.0
Times faster with dictionary: 583


In [11]:
# 4th Inventory class 
# Add check_promotion_dollars() method
# Method, given dollar amount, checks if can purchase up to two laptops

class Inventory():
    import csv
    
    def __init__(self, csv_filename, encoding = 'utf8'):
        
        # Open file as list of lists
        with open(csv_filename) as file:
            master = list(csv.reader(file))
            self.header = master[0]
            self.rows = master[1:]
            
            # Price from str to int
            for row in self.rows:
                row[12] = int(row[12])
        
        #Preprocess data into dict with laptop id as key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row.update({row[0]: row[1:]})
    
    # Search every datapoint for id match
    def get_laptop_from_id(self, laptop_id):
        if type(laptop_id) == int:
            laptop_id = str(laptop_id)
        for row in self.rows:
            if row[0] == laptop_id:
                return row
    
    # Search self.id_to_row dictionary keys
    def get_laptop_from_id_fast(self, laptop_id):
        if type(laptop_id) != str:
            laptop_id = str(laptop_id)
            
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
    
    # Test for ability to purchase two laptops given dollar amount
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[12] == dollars:
                return True
            
        for row in self.rows:
            for next_row in self.rows:
                if (next_row[12] + row[12]) == dollars:
                    return True
        return False

In [12]:
check_5 = Inventory('laptops.csv')
print('Can $1000 buy 2 laptops:', check_5.check_promotion_dollars(1000))
print('Can $442 buy 2 laptops:', check_5.check_promotion_dollars(442))

Can $1000 buy 2 laptops: True
Can $442 buy 2 laptops: False


In [13]:
# 5th Inventory class 
# Add faster check_promotion_dollars method
class Inventory():
    import csv
    
    def __init__(self, csv_filename, encoding = 'utf8'):
        
        # Open file as list of lists
        with open(csv_filename) as file:
            master = list(csv.reader(file))
            self.header = master[0]
            self.rows = master[1:]
            
            # Price from str to int
            for row in self.rows:
                row[12] = int(row[12])
        
        # Preprocess data into dict with laptop id as key
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row.update({row[0]: row[1:]})
            
        # Preprocess price data into set
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[12])
    
    # Search every datapoint for id match
    def get_laptop_from_id(self, laptop_id):
        if type(laptop_id) == int:
            laptop_id = str(laptop_id)
        for row in self.rows:
            if row[0] == laptop_id:
                return row
    
    # Search self.id_to_row dictionary keys
    def get_laptop_from_id_fast(self, laptop_id):
        if type(laptop_id) != str:
            laptop_id = str(laptop_id)
            
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
    
    # Test for ability to purchase two laptops given dollar amount
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[12] == dollars:
                return True
        
        for row in self.rows:
            for next_row in self.rows:
                if (next_row[12] + row[12]) == dollars:
                    return True
        return False
    
    # Test for ability to purchase two laptops given dollar amount using price set
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        
        for price in self.prices:
            for next_price in self.prices:
                if (price + next_price) == dollars:
                    return True
        return False

In [14]:
check_6 = Inventory('laptops.csv')
print('Can $1000 buy 2 laptops:', check_6.check_promotion_dollars_fast(1000))
print('Can $442 buy 2 laptops:', check_6.check_promotion_dollars_fast(442))

Can $1000 buy 2 laptops: True
Can $442 buy 2 laptops: False


In [15]:
# Measure execution times of check_promotion_dollars and check_promotion_dollars_fast
import time, random
prices = [str(random.randint(100, 5000)) for _ in range(101)]
check_7 = Inventory('laptops.csv')

total_time_no_set = 0
for i in prices:
    start = time.time()
    check_7.check_promotion_dollars(i)
    end = time.time()
    run_t = end - start
    total_time_no_set += run_t

total_time_set = 0
for i in prices:
    start = time.time()
    check_7.check_promotion_dollars_fast(i)
    end = time.time()
    run_t = end - start
    total_time_set += run_t
    
print('Time without price set:', round(total_time_no_set, 2))
print('Time with price set:', round(total_time_set, 2))
print('Times faster with price set:', round(total_time_no_set / total_time_set)+0.5)

Time without price set: 22.92
Time with price set: 3.11
Times faster with price set: 7.5


In [16]:
# 6th Inventory class
# Help customer find all laptops given budget of D dollars (write method to answer query: Given budget, find all laptops)
# Sort all laptops by price to then use binary search to find first laptop in the sorted list with a price larger than D
def row_price(row):
    return row[-1]

class Inventory():                    
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f: 
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]        
        self.rows = rows[1:]
        for row in self.rows:              
            row[-1] = int(row[-1])
        self.id_to_row = {}                        
        for row in self.rows:                       
            self.id_to_row[row[0]] = row
        self.prices = set()                          
        for row in self.rows:                        
            self.prices.add(row[-1])
        self.rows_by_price = sorted(self.rows, key=row_price)
    
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:                 
            if row[0] == laptop_id:
                return row
        return None   
    
    def get_laptop_from_id_fast(self, laptop_id):  
        if laptop_id in self.id_to_row:           
            return self.id_to_row[laptop_id]
        return None

    def check_promotion_dollars(self, dollars):    
        for row in self.rows:                   
            if row[-1] == dollars:
                return True
        for row1 in self.rows:                  
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:
                    return True
        return False                        
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:                   
            return True
        for price in self.prices:                    
            if dollars - price in self.prices:
                return True
        return False                                
    
    def find_laptop_with_price(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1                       
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            value = self.rows_by_price[range_middle][-1]
            if value == target_price:                            
                return range_middle                        
            elif value < target_price:                           
                range_start = range_middle + 1             
            else:                                          
                range_end = range_middle - 1 
        if self.rows_by_price[range_start][-1] != target_price:                  
            return -1                                      
        return range_start
    
    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1                   

        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            
            if price > target_price:
                range_end = range_middle
            else:
                range_start = range_middle + 1
        
        if self.rows_by_price[range_start][-1] <= target_price:                  
            return -1                                   
        return range_start

In [17]:
check_8 = Inventory('laptops.csv')
print('First laptop costing > $1000:', rows[check_8.find_first_laptop_more_expensive(1000)])
print('\n')
print('First laptop costing > $683:', rows[check_8.find_first_laptop_more_expensive(683)])

First laptop costing > $1000: ['4910469', 'HP', '17-bs000nv I3', 'Notebook', '17.3', 'IPS Panel Full HD 1920x1080', 'Intel Core i3 6006U 2GHz', '4GB', '256GB SSD', 'AMD Radeon R5 520', 'Windows 10', '2.5kg', '699']


First laptop costing > $683: ['9548081', 'Dell', 'Precision M5520', 'Workstation', '15.6', '4K Ultra HD / Touchscreen 3840x2160', 'Intel Core i7 7700HQ 2.8GHz', '8GB', '256GB SSD', 'Nvidia Quadro M1200', 'Windows 10', '1.78kg', '2712']


__Conclusions__

In this project, three functionalities were created:

- Looking up laptops by their ID number
- Seeing how many laptops a customer could afford given their budget
- Determining the highest priced laptop at which the customer could not afford; determining all of the laptops within the customer's budget

The three built-in Python modules used were: csv, time, and random

In this project, data pre-processing was used to significantly improve the time efficiency of the methods.