<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

In [None]:
import pandas as pd
import numpy as np
import datetime

pd.options.display.float_format = '{:,.2f}'.format

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Read data from a `.csv` file into a DataFrame

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv"
)

In [None]:
data

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

**NOTE:**
* many other methods:
    * `read_excel` - reads excel files (can specify which sheet)
    * `read_fwf` - reads fixed-width files
    * `read_json` - reads data from JSON files
    * `read_parquet` - reads Parquet files (column-based files, often used in data lakes)
    * `read_sql` - reads data from a SQL database (you'll have to specify a connection object and a SQL query)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Read only specific columns from a `.csv` file into a DataFrame

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    usecols=["Customer Id", "Helpfulness", "Rep Id", "Date"]
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Specify custom column names when reading a  `.csv` file into a DataFrame

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample_no_header.csv", 
    header=None,
    names=["Col 1", "Col 2", "Col 3",
          "Col 4", "Col 5", "Col 6",
          "Col 7", "Col 8", "Col 9"]
)

In [None]:
data

<br><br>

### You can still only read the columns you want

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    header=0,
    usecols=[0, 2, 7, 8],
    names=["Col 1", "Col 3", "Col 8", "Col 9"]
)

In [None]:
data

<br><br>

### You can also use this technique if you want to overwrite the existing column names

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    header=0,
    names=["Col 1", "Col 2", "Col 3",
          "Col 4", "Col 5", "Col 6",
          "Col 7", "Col 8", "Col 9"]
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Read only a specified number of rows from a   `.csv` file into a DataFrame

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    nrows=7
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Specify which column(s) to use as the row index

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    index_col='Customer Id'
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Specify how many rows to skip at the start of the file

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample_skip_rows.csv", 
    skiprows=4
)

In [None]:
data

<br><br>

<br><br>

## Read data directly from a database (AWS Redshift, PostgreSQL, AWS Aurora, Microsoft SQL Server, Oracle, etc.)

* use the pandas `read_sql()`
<br><br>
* you just need to pass a SQL connection object and the SQL code you want to run
<br><br>
* to create a SQL connection, try **sql_alchemy**:
  * `pip install sqlalchemy`
  * also need **psycopg2**: `pip install psycopg2`
  * https://www.sqlalchemy.org/

### Sample database

* Postgresql database
* contains `hospital_data` table
* endpoint: `training-data.choqmyihntw0.us-east-1.rds.amazonaws.com`
* read-only user / pwd: `student` / `yus6UMCgVEpEY4Q79BZ`
* port: `5432`
* database name: `postgres`

**We'll use this data in one of the projects!**

In [None]:
from sqlalchemy import create_engine

In [None]:
conn = create_engine('postgresql://student:yus6UMCgVEpEY4Q79BZ@training-data.choqmyihntw0.us-east-1.rds.amazonaws.com:5432/postgres')

In [None]:
conn

In [None]:
hospital_data = pd.read_sql(
    'SELECT * FROM hospital_data;', 
    conn)

In [None]:
hospital_data

### CAUTION:

* **NEVER, EVER store credentials in code!**
  * I use clear user / pwd here for simplicity. **DO NOT DO THIS IN PRODUCTION!**
  * for code deployed on AWS, you can use temporary permissions - slightly more advanced topic
    * an example here: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAMDBAuth.Connecting.Python.html
  * once you create the connection object, just pass that to `read_sql()` as you normally would

* use roles / users to get the minimum required permissions to perform the task!

* **very few people should ever have write permission on production systems!**

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Read from Excel files

* use the built-in `read_excel` function
* can specify which sheet you want to import
* lots of optional arguments to accommodate a variety of scenarios

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

The code below reads the sheet called `Departments` of the `survey_mappings.xlsx` file into a new dataframe called `departments`. 

In [None]:
departments = pd.read_excel(
    'https://edlitera-datasets.s3.amazonaws.com/survey_mappings.xlsx', 
    'Departments'
)

departments

<br><br>

The code below reads the sheet called `Reps` of the `survey_mappings.xlsx` file into a new dataframe called `reps`.

In [None]:
reps = pd.read_excel(
    'https://edlitera-datasets.s3.amazonaws.com/survey_mappings.xlsx', 
    'Reps'
)

reps

<br><br>

The code below reads the first sheet of the `customer_ltv.xlsx` file into a new datframe called `ltv`. In addition, we tell Python that numbers use `,` to separate thousands. 

**NOTE:** Sometimes, it is necessary to tell Python that numbers use `,` to separate thousands. This allows it to properly parse numbers. **Only necessary for columns stored as TEXT in Excel.**

In [None]:
ltv = pd.read_excel(
    'https://edlitera-datasets.s3.amazonaws.com/customer_ltv.xlsx',
    thousands=','
)

ltv

In [None]:
ltv.info()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Read only specific columns from an Excel file into a DataFrame

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.xlsx", 
    usecols=["Customer Id", "Helpfulness", "Rep Id", "Date"]
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Specify custom column names when reading an Excel file into a DataFrame

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample_no_header.xlsx", 
    header=None,
    names=["Col 1", "Col 2", "Col 3",
          "Col 4", "Col 5", "Col 6",
          "Col 7", "Col 8", "Col 9"]
)

In [None]:
data

<br><br>

### You can still only read the columns you want

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.xlsx", 
    header=0,
    usecols=[0, 2, 7, 8],
    names=["Col 1", "Col 3", "Col 8", "Col 9"]
)

In [None]:
data

<br><br>

### You can also use this technique if you want to overwrite the existing column names

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.xlsx", 
    header=0,
    names=["Col 1", "Col 2", "Col 3",
          "Col 4", "Col 5", "Col 6",
          "Col 7", "Col 8", "Col 9"]
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Read only a specified number of rows from an Excel file into a DataFrame

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.xlsx", 
    nrows=7
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Specify which column(s) to use as the row index

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.xlsx", 
    index_col='Customer Id'
)

In [None]:
data

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Specify how many rows to skip at the start of the file

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample_skip_rows.xlsx", 
    skiprows=4
)

In [None]:
data

<br><br><br><br>

## Parse dates when reading data into a DataFrame

**NOTE:** By default, dates are always assumed to have a time component. If no time is actually specified, the time is assumed to be `00:00:00`.

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

In [None]:
data

<br><br>

In [None]:
data = pd.read_excel(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.xlsx", 
    parse_dates=['Date']
)

In [None]:
data