# Lesson: Classification Data Acquisition

Plan --> <b>Acquire</b> --> Prepare --> Explore --> Model --> Deliver

<hr style="border:2px solid gray">

### Goals:
At the end of this lesson, you will know...

- How to read data from a <b>csv</b> using ```read_csv```. This can be a csv stored locally or a Google Sheet or a file stored in Amazon Web Service (AWS) S3.
<br>

- How to read data from your <b>local clipboard</b> using ```pandas.read_clipboard```. This can be useful for quickly transferring data to/from a spreadsheet.
<br>

- How to read data from <b>Microsoft Excel</b> using ```read_excel```
<br>

- How to read data from a mySQL server using ```read_sql(sql_query, connection_url)```. Using this, you can read data using a SQL query to a database. You must have the required drivers installed, and a specially formatted url string must be provided.
<br>

- How, when and why to cache data locally

<hr style="border:1.5px solid black">

### Methods of Data Acquisition

```read_clipboard```: When you have data copied to your clipboard, you can use pandas to read it into a data frame with pd.read_clipboard. This can be useful for quickly transferring data to/from a spreadsheet.

<br>

```read_excel```: This function can be used to create a data frame based on the contents of an Excel spreadsheet.

<br>

```read_csv```: Read from a local csv, or from a the cloud (Google Sheets or AWS S3).

<br>

```read_sql```(sql_query, connection_url): Read data using a SQL query to a database. You must have the required drivers installed, and a specially formatted url string must be provided.

To talk to a mysql database:
<br>

```` python -m pip install pymysql mysql-connector````

<br>

The connection url string:
<br>

```` mysql+pymysql://USER:PASSWORD@HOST/DATABASE_NAME````

<hr style="border:1.5px solid black">

### Source: Clipboard

In [1]:
import pandas as pd

In [2]:
# reads data copied to your clipboard
df1 = pd.read_clipboard()
df1

Unnamed: 0,At,the,end,of,this,"lesson,",you,will,know...


<hr style="border:1.5px solid black">

### Source: A Shared Google Sheet
1. Get the shareable link url: https://docs.google.com/spreadsheets/d/BLAHBLAHBLAH/edit#gid=NUMBER

2. Turn that into a CSV export URL: Replace ```/edit``` with ```/export```; Add ```format=csv``` to the beginning of the query string. https://docs.google.com/spreadsheets/d/BLAHBLAHBLAH/export?format=csv&gid=NUMBER:

3. Pass it to pd.read_csv, which can take a URL.

In [3]:
sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'    

csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
csv_export_url

'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/export?format=csv&gid=341089357'

In [4]:
df_googlesheet = pd.read_csv(csv_export_url)
df_googlesheet.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<hr style="border:1.5px solid black">

### Source: CSV (Hosted or Local)

In [5]:
url = "https://gist.githubusercontent.com/ryanorsinger/bec2f59a9cef8ae7428cb70b3541354a/raw/ef64298da52e5d70f4d388f5fd48eccdb02ed3f1/ice_cream.csv"

df = pd.read_csv(url)
df.head()

Unnamed: 0,flavor,pints
0,moolenium crunch,11.05757
1,bubblegum,6.288724
2,chubby hubby,7.660815
3,bubblegum,6.644338
4,neopolitan,13.600125


<hr style="border:1.5px solid black">

### Source: SQL
Create a dataframe from the passengers table in the mySQL database, titanic_db.

<div class="alert alert-danger" role="alert">
    <div class="row vertical-align">
        <div class="col-xs-1 text-center">
            <i class="fa fa-exclamation-triangle fa-2x"></i>
        </div>
        <div class="col-xs-11">
                <strong>Database Credentials</strong>
        <br>
        It's a bad idea to store your database access credentials (i.e. your username and password) in plaintext in your source code. There are many different ways one could manage secrets like this, but a simple way is to store the values in a python file that is not included along with the rest of your source code. This is what we have done with the env module.
            </div> 


<div class="alert alert-danger" role="alert">
    <div class="row vertical-align">
        <div class="col-xs-1 text-center">
            <i class="fa fa-exclamation-triangle fa-2x"></i>
        </div>
        <div class="col-xs-11">
                <strong> Remember:</strong>
            Be sure to import .gitignore prior to pushing env.py
</div>

In [6]:
import env

In [7]:
def get_db_url(db, username=env.username, host=env.host, password=env.password):
    return f'mysql+pymysql://{username}:{password}@{host}/{db}'

In [8]:
df = pd.read_sql('SELECT * FROM passengers', get_db_url('titanic_db'))

In [9]:
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


We will create a function that we can reference later to acquire the data

In [10]:
def new_titanic_data():
    url = get_db_url('titanic_db')
    
    return pd.read_sql('SELECT * FROM passengers', url)

In [11]:
# acquire new data:
new_titanic_data()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.9250,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1000,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.0500,S,Third,,Southampton,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,,Southampton,1
887,887,1,1,female,19.0,0,0,30.0000,S,First,B,Southampton,1
888,888,0,3,female,,1,2,23.4500,S,Third,,Southampton,0
889,889,1,1,male,26.0,0,0,30.0000,C,First,C,Cherbourg,1


We'll store this function in a file named ```acquire.py```

<hr style="border:1.5px solid black">
<hr style="border:1.5px solid black">

### Caching Your Data
Because data acquisition can take time, it's a common practice to write the data locally to a .csv file.

1. Do whatever you need to do to produce the dataframe that you need.
    - For example ```df = pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))```
    - Or your dataframe could include joins, multiple data sources, etc...
    
<br>

2. Now use ```df.to_csv("titanic.csv")``` to write that dataframe to the file.
<br>

3. Now that you've written the csv file, you can use it later in other parts of your pipeline!
<br>

4. Consider the following function:

In [12]:
import os

def get_titanic_data():
    filename = "titanic.csv"
    
    # if file is available locally, read it
    if os.path.isfile(filename):
        return pd.read_csv(filename)
    
    # if file not available locally, acquire data from SQL database
    # and write it as csv locally for future use
    else:
        # read the SQL query into a dataframe
        df = new_titanic_data()
        
        # Write that dataframe to disk for later. Called "caching" the data for later.
        df.to_csv(filename)

        # Return the dataframe to the calling code
        return df  

In [13]:
df = get_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1
