# Wrangle (Acquire and Prepare)

This notebook contains all steps taken in the data acquisition and preparation phases of the data science pipeline for the Superstore Time Series project. This notebook does rely on helper files so if you want to run the code blocks in this notebook ensure that you have all the helper files in the same directory.

---

## The Required Imports

Everything we need to run the code blocks in this notebook are imported below. To run the code blocks in this report you will need numpy, pandas, matplotlib, seaborn and sklearn installed on your computer.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from _acquire import Acquire

---

## Data Acquisition

In this section we'll cover all the steps taken to acquire the superstore data.

### Reading the Data From the Database

The superstore data is located in the MySQL database hosted at data.codeup.com. We'll need to write a SQL query to select the data.

In [2]:
# We'll need the get_db_url function
from get_db_url import get_db_url

In [3]:
# Here we'll use an SQL query to select the superstore data from data.codeup.com

sql = '''
SELECT
    orders.*,
    Category,
    `Sub-Category`,
    `Customer Name`,
    `Product Name`,
    `Region Name`
FROM orders
JOIN categories USING (`Category ID`)
JOIN customers USING (`Customer ID`)
JOIN products USING (`Product ID`)
JOIN regions USING (`Region ID`);
'''

superstore = pd.read_sql(sql, get_db_url('superstore_db'))
superstore.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1734 entries, 0 to 1733
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Order ID       1734 non-null   object 
 1   Order Date     1734 non-null   object 
 2   Ship Date      1734 non-null   object 
 3   Ship Mode      1734 non-null   object 
 4   Customer ID    1734 non-null   object 
 5   Segment        1734 non-null   object 
 6   Country        1734 non-null   object 
 7   City           1734 non-null   object 
 8   State          1734 non-null   object 
 9   Postal Code    1734 non-null   float64
 10  Product ID     1734 non-null   object 
 11  Sales          1734 non-null   float64
 12  Quantity       1734 non-null   float64
 13  Discount       1734 non-null   float64
 14  Profit         1734 non-null   float64
 15  Category ID    1734 non-null   int64  
 16  Region ID      1734 non-null   int64  
 17  Category       1734 non-null   object 
 18  Sub-Cate

### Make it Reproducible

Now that we know how to get the data we need to make the acquisition code reproducible.

In [4]:
# We can inherit the Acquire class to get all the cacheing code. We'll just need to include the code 
# for reading from the database.

class AcquireSuperstore(Acquire):
    def __init__(self):
        self.file_name = 'superstore.csv'
        self.database_name = 'superstore_db'
        self.sql = '''
        SELECT
            orders.*,
            Category,
            `Sub-Category`,
            `Customer Name`,
            `Product Name`,
            `Region Name`
        FROM orders
        JOIN categories USING (`Category ID`)
        JOIN customers USING (`Customer ID`)
        JOIN products USING (`Product ID`)
        JOIN regions USING (`Region ID`);
        '''
        
    def read_from_source(self):
        return pd.read_sql(self.sql, get_db_url(self.database_name))

In [5]:
# Let's test it

superstore = AcquireSuperstore().get_data()
superstore.shape

Reading from source.
Cacheing data.


(1734, 22)

---

## Data Preparation