## Big Ideas
- Cache your data to speed up your data acquisition.
- Helper functions are your friends.

## Objectives
#### By the end of the acquire lesson and exercises, you will be able to...

- **read data into a pandas DataFrame using the following modules:**

`pydataset`

In [None]:
from pydataset import data
df = data('dataset_name')

`seaborn datasets`

In [None]:
import seaborn as sns
df = sns.load_dataset('dataset_name')

- read data into a pandas DataFrame from the following sources:
    - an Excel spreadsheet
    - a Google sheet
    - Codeup's mySQL database

In [None]:
pd.read_excel('file_name.xlsx', sheet_name='sheet_name')
pd.read_csv('filename.csv')
pd.read_sql(sql_query, connection_url)

- **use pandas methods and attributes to do some initial summarization and exploration of your data.**

In [None]:
.head()
.shape
.info()
.columns
.dtypes
.describe()
.value_counts()

- **create functions that acquire data from Codeup's database, save the data locally to CSV files (cache your data), and check for CSV files upon subsequent use.**
- **create a new python module, acquire.py, that holds your functions that acquire the titanic and iris data and can be imported and called in other notebooks and scripts.**

In [5]:
import pandas as pd
import numpy as np
import os

# visualize
import seaborn as sns
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(11, 9))
plt.rc('font', size=13)

# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

# acquire
from env import host, user, password
from pydataset import data

## From a Database
Create your DataFrame using a SQL query to access a database.

#Import private info to keep it secret in public files.

`from env import host, password, user`

#Test query in Sequel Pro and save to a variable.

`sql_query = 'write your sql query here; test it in Sequel Pro first!'`

#Save connection url to a variable for use with pandas `read_sql()` function.

`connection_url = f'mysql+pymysql://{user}:{password}@{host}/database_name'`

#Python function to read data from database into a DataFrame.

`pd.read_sql(sql_query, connection_url)`

In [8]:
# Create sql query and save to variable.

sql_query = 'SELECT * FROM passengers'

In [9]:
# Create connection url and save to a variable.

connection_url = f'mysql+pymysql://{user}:{password}@{host}/titanic_db'

In [10]:
# Use my variables in the pandas read_sql() function.

titanic_df = pd.read_sql(sql_query, connection_url)
titanic_df.head(3)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


## From Files
- **Create your DataFrame from a csv file.**

`df = pd.read_csv('file_path/file_name.csv')`

- **Create your DataFrame from an AWS S3 file.**

`df = pd.read_csv('https://s3.amazonaws.com/bucket_and_or_file_name.csv')`

- **Create your DataFrame from a Google sheet using its Share url.**

`sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'`

`csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')`

`df = pd.read_csv(csv_export_url)`

In [14]:
# Assign our Google Sheet share url to a variable.
# make sure your share setting allow you to share

sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'

In [15]:
# Use the replace method to modify our Google Sheet share url to be a csv export url.
# use that last
csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')

In [16]:
# Use read_csv() method to create our DataFrame.

df_googlesheet = pd.read_csv(csv_export_url)
df_googlesheet.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


## From Your Clipboard
- **Read copy-pasted tabular data and parse it into a DataFrame.**

### Default
`df = pd.read_clipboard(sep='\\s+', **kwargs)`

### Some examples of options I have.
`columns = ['column_1', 'column_2', 'column_3']
df = pd.read_clipboard(sep=',', header=None, names=columns)`

https://towardsdatascience.com/pandas-hacks-read-clipboard-94a05c031382 a short and sweet article that explains it all nicely.

In [19]:
# Try out the read_clipboard() method here using the article.
pd.read_clipboard()
#will read whatever you have on your clipboard at that time

Unnamed: 0,0,1,2,3
0,0.850004,0.206778,0.6552,0.079339
1,0.948567,0.749701,0.116241,0.069551
2,0.834722,0.360724,0.410327,0.535236
3,0.221309,0.916424,0.649175,0.80375


In [22]:
# Try out the read_clipboard() method with data without headers/column names.
pd.read_clipboard()
#comes out messy bc the default for the separator are not the comma seperations like we have here but by doing
columns = [
    'PassengerId', 'Survived', 'Pclass', 'Name',
    'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
    'Cabin', 'Embarked']
df = pd.read_clipboard(sep=',', header=None, names=columns)
# sep is what you are making your seperator
df
# now it is cleaned up perfectly

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## From an Excel Sheet

`pd.read_excel('your_excel_file_name.xlsx', sheet_name='your_table_name', usecols=['this_one', 'this_one'])`

In [23]:
# Read in one sheet from my_telco_churn excel workbook.

customers_df = pd.read_excel('my_telco_churn.csv', sheet_name='Table2_CustDetails')
customers_df.head(3)

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'customer'

From Pydataset
Create your DataFrame using Pydataset and Read the Doc.

In [None]:
from pydataset import data

data('iris', show_doc=True)

df_iris = data('iris')

In [None]:
# Create DataFrame using pydataset 'iris'

df_iris = data('iris')
df_iris.head(3)

In [None]:
# Using Seaborn Datasets. This one has nice column names! :)

iris = sns.load_dataset('iris')
iris.head(3)

Automating Data Acquisition
The process of acquiring, preparing, exploring, modeling, and evaluating data is called the Data Science Pipeline.
As we go through the pipeline, our goal is to end each stage with functions that automate the process and can feed into the next stage, making our work faster and more importantly, repeatable.
We store our functions from each stage in modules, acquire.py, prepare.py, etc., and import them for use in our notebooks. All of the helper and main functions are stored in the .py file or module to keep our notebook clean and readable.
Ideally, upon completing the entire process, we should be able to use all of our functions, from each stage, to create one pipeline function that can reproduce our entire process from aquisition to evaluation.
If our goal is to acquire the titanic data from the Codeup database, both of the funtions below would be stored in an acquire.py file and imported into our notebook for use.

In [None]:
# Create helper function to get the necessary connection url.
def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    '''
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

# Use the above helper function and a sql query in a single function.
def get_db_data():
    '''
    This function reads data from the Codeup db into a df.
    '''
    sql_query = 'write your sql query here; test it in Sequel Pro first!'
    return pd.read_sql(sql_query, get_connection('database_name'))

In [None]:
# Let's create a helper function that creates our connection url.

def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    '''
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [None]:
def new_titanic_data():
    '''
    This function reads in the titanic data from the Codeup db
    and returns a pandas DataFrame with all columns.
    '''
    sql_query = 'SELECT * FROM passengers'
    return pd.read_sql(sql_query, get_connection('titanic_db'))

Caching Data

Caching or storing data you've retrieved from a database or website makes accessing it later much faster. Basically, cached data reduces load times.

We can design our acquire functions to get our data for us faster by reading in a csv file, if one exists, and if not, acquiring our data and creating a csv file for later use.

The os.path.isfile() method in Python is used to check whether a specified path is an existing file or not. It returns a boolean value.

In [None]:
# Let's check to see if a file names 'titanic_df.csv' exists in this directory.
os.path.isfile('titanic_df.csv')

In [None]:
# Let's write our 'titanic_df' DataFrame to a csv file.

titanic_df.to_csv('titanic_df.csv')

In [None]:
# Let's check again...

os.path.isfile('titanic_df.csv')

Let's use this concept to write a new function that allows us to hit the Codeup database, write the data to a csv file for later use, and read the data into a pandas DataFrame the next time we call the function and the csv file exists.

In [None]:
# Here is our first helper function that's used below.

def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    '''
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [None]:
# Let's use our new_titanic_data() function from above as a helper in a final function.

def new_titanic_data():
    '''
    This function reads the titanic data from the Codeup db into a df,
    write it to a csv file, and returns the df.
    '''
    # Create SQL query.
    sql_query = 'SELECT * FROM passengers'
    
    # Read in DataFrame from Codeup db.
    df = pd.read_sql(sql_query, get_connection('titanic_db'))
    
    return df

In [None]:
def get_titanic_data(cached=False):
    '''
    This function reads in titanic data from Codeup database and writes data to
    a csv file if cached == False or if cached == True reads in titanic df from
    a csv file, returns df.
    '''
    if cached == False or os.path.isfile('titanic_df.csv') == False:
        
        # Read fresh data from db into a DataFrame.
        df = new_titanic_data()
        
        # Write DataFrame to a csv file.
        df.to_csv('titanic_df.csv')
        
    else:
        
        # If csv file exists or cached == True, read in data from csv.
        df = pd.read_csv('titanic_df.csv', index_col=0)
        
    return df

In [None]:
df = get_titanic_data()
df.head(2)

In [None]:
df = get_titanic_data(cached=False)
df.head(2)

### Goals
Data you wish to use in analysis will be stored in a variety of sources. In this lesson, we will review importing data from a csv and via mySQL, and we will also learn how to import data from our local clipboard, a google sheets document, and from an MS Excel file. We will then select one source to use as we continue through the rest of this module.

## Methods of Data Acquisition
- `read_clipboard`: When you have data copied to your clipboard, you can use pandas to read it into a data frame with - `pd.read_clipboard`. This can be useful for quickly transferring data to/from a spreadsheet.
- `read_excel`: This function can be used to create a data frame based on the contents of an Excel spreadsheet.
- `read_csv`: Read from a local csv, or from a the cloud (Google Sheets or AWS S3).
- `read_sql(sql_query, connection_url)`: Read data using a SQL query to a database. You must have the required drivers installed, and a specially formatted url string must be provided.

- To talk to a mysql database:

python -m pip install pymysql mysql-connector
- the connection url string:

mysql+pymysql://USER:PASSWORD@HOST/DATABASE_NAME

## Source: A Shared Google Sheet
1. Get the shareable link url: https://docs.google.com/spreadsheets/d/BLAHBLAHBLAH/edit#gid=NUMBER
2. Turn that into a CSV export URL: Replace /edit with /export; Add format=csv to the beginning of the query string. https://docs.google.com/spreadsheets/d/BLAHBLAHBLAH/export?format=csv&gid=NUMBER:
3. Pass it to pd.read_csv, which can take a URL.

In [1]:
import pandas as pd

sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'    

csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')

df_googlesheet = pd.read_csv(csv_export_url)
df_googlesheet.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Source: AWS S3

In [2]:
# If the S3 file is private, you will need your S3 configurations setup properly.
df_s3 = pd.read_csv('https://s3.amazonaws.com/irs-form-990/index_2011.csv')
df_s3.head()

Unnamed: 0,RETURN_ID,FILING_TYPE,EIN,TAX_PERIOD,SUB_DATE,TAXPAYER_NAME,RETURN_TYPE,DLN,OBJECT_ID
0,9091250,EFILE,591971002,201009,11/30/2011 1:06:39 AM,ANGELUS INC,990,93493316003251,201103169349300325
1,9091274,EFILE,251713602,201106,11/30/2011 1:09:14 AM,TOUCH-STONE SOLUTIONS INC,990,93493313012311,201113139349301231
2,9091275,EFILE,232705170,201012,11/30/2011 1:09:16 AM,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA ...,990,93493313013011,201113139349301301
3,9091276,EFILE,581805618,201106,11/30/2011 1:09:19 AM,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK T...,990,93493313013111,201113139349301311
4,9091277,EFILE,581876019,201106,11/30/2011 1:09:21 AM,HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR,990,93493313013161,201113139349301316


## Source: SQL
Create a dataframe from the passengers table in the mySQL database, titanic_db.

#### Database Credentials

It's a bad idea to store your database access credentials (i.e. your username and password) in plaintext in your source code. There are many different ways one could manage secrets like this, but a simple way is to store the values in a python file that is not included along with the rest of your source code. This is what we have done with the env module.

In [3]:
import env

def get_connection(db, user=env.user, host=env.host, password=env.password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

df = pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))

df.head()

ModuleNotFoundError: No module named 'env'

We will create a function that we can reference later to acquire the data:

In [None]:
def get_titanic_data():
    return pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))

We'll store this function in a file named acquire.py.

## Caching Your Data
Because data acquisition can take time, it's a common practice to write the data locally to a `.csv` file.

1. Do whatever you need to do to produce the dataframe that you need.
    - For example df = `pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))`
    - Or your dataframe cound include joins, multiple data sources, etc...
2. Now use `df.to_csv("titanic.csv")` to write that dataframe to the file.
3. Now that you've written the csv file, you can use it later in other parts of your pipeline!
4. Consider the following function:

import os

def get_titanic_data():
    filename = "titanic.csv"

    if os.path.isfile(filename):
        return pd.read_csv(filename)
    else:
        # read the SQL query into a dataframe
        df = pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))

        # Write that dataframe to disk for later. Called "caching" the data for later.
        df.to_file(filename)

        # Return the dataframe to the calling code
        return df  