![pandas](../images/pandas_logo.png)
# Loading Data into Python with Pandas

![sleepy panda](../images/panda-sleep-2.jpg)

One of the fundamental concepts across Python, R, and databases is the tabular data structure, especially with named columns and records as rows.
In Python, we use the Pandas library and the DataFrame data structure.
In R you will learn of the native data frame type.
In SQL, the `table` is the fundamental data storage concept.

In each of these cases, there are a variety of methods to perform operations down a column of data. 
Additionally, you can subset the data by selecting a list of columns and filtering out others to only have particular rows based on boolean (T/F) tests of conditions.
Even more concepts are possible, as you will see, such as grouping rows for analytics and other operations.

Python has an extensive set of libraries that make it easier to manipulate data. 
**Pandas** is one such package that is shown here and used extensively within this course. 
This notebook illustrates the basics of reading a file into a **Panda dataframe**. 
Not only does `Pandas` assist in data loading, but the _data frame_ concept is the preferred structure of data for numerous compuational libraries that are part of the extensive Python ecosystem.

----

### Load a csv file using Pandas. 

----

**Reference: **[Loading a CSV into Pandas](http://chrisalbon.com/python/pandas_dataframe_importing_csv.html)


** There are two ways of reading a Comma Separated Variables (CSV) file into a dataframe: **

* pd.read_csv(path, index_col=0, parse_dates=True)

* pd.DataFrame.from_csv(path)

#### Method 1
The first one is the preferred way of reading a CSV file.

In [None]:
import pandas as pd

with open('/dsa/data/all_datasets/auto-mpg/auto-mpg.csv', 'r') as file:
    df1 = pd.read_csv(file)

It's that simple! <br><br>

The simple code block above does three things:

 1. Load the `pandas` library giving it an alias name of `pd`.
 2. Opens the file `/dsa/data/all_datasets/auto-mpg/auto-mpg.csv`.
    * This file is located from the root of the file system, then down each folder after a "/" and finally down into the `auto-mpg/` folder
    * The open file is then referred to as the variable `file`
 3. The data is then read 
    1. Using the pandas library, 
    1. Interpretted as a CSV formatted file, and
    1. Stored as a `DataFrame` in a variable named `df1`

Actually, it can be even simpler. Running the one-line of code below would produce the same result. 

```python
pd.read_csv('/dsa/data/all_datasets/auto-mpg/auto-mpg.csv')
```

** Let's see what the _df1_ `DataFrame` looks like ... **


The _head()_ function will display first 5 rows of data that gives an overview of what kind of data each column holds. 

In [None]:
# This is a comment after the '#'

df1.head()   # the head method/function/command on the DataFrame variable previews the first 5 rows

# Optionally df.head(10) would show 10 rows of data

In [None]:
df1.tail()  # the tail method/function/command on the DataFrame variable previews the last 5 rows

#### Method 2

Read the data using **pd.DataFrame.from_csv()**


**Syntax:** 

pandas.DataFrame.from_csv(path, header=0, sep=', ', index_col=0, parse_dates=True, encoding=None, tupleize_cols=False, infer_datetime_format=False)

In [None]:
df2=pd.DataFrame.from_csv('../../../datasets/auto-mpg/auto-mpg.csv', index_col=None)

In [None]:
df2.head()

It is important to note that **pd.DataFrame.from_csv()** differs from **pandas.read_csv()** for some of the default conditions:

- We have to specify `index_col=None` to ensure the first column is treated as data and not a row index.
- Additionally, data files with dates will be treated differently.


----

### Using `read_csv()` for non-CSV files

`pandas.read_csv()` can also be used for reading data from a other types of text files into pandas dataframe.
In this example, we are loading a file that is _white space separated_ instead of _comma separated_.
_White space_ includes space characters, tab characters, and a few other special characters that do not normally render as text. 

In [None]:
# Here we add the addition sep function argument

text_df = pd.read_csv("/dsa/data/all_datasets/auto-mpg/auto-mpg.txt", sep='\s+')  # \s+ denotes the data delimeter to be white spaces

In [None]:
text_df.head()

----

### Subsetting Data

Once data is into panda DataFrame, we can filter the data based on columns headings and values within rows.


In [None]:
import pandas as pd

with open('/dsa/data/all_datasets/auto-mpg/auto-mpg.csv', 'r') as file:
    cars = pd.read_csv(file)
cars.head()

#### Selecting Columns

Selecting columns is accomplished using the `[]` brackets.
 1. Selecting one column is accomplished by listing the desired column name, e.g., `['car name']`
 1. Selecting more than one column is done by passing a list of columns into the `[]` brackets.
    * For example:
```
cars[
        ['cylinders','car name']
    ]
```

In [None]:
names = cars['car name']
names.head()

In [None]:
cylinders_and_name = cars[['cylinders','car name']]
cylinders_and_name.head()

#### Selecting Rows

Recall from the NumPy lesson the selection of rows.
Row selection is similar in Pandas where you test a value within a column.
The test will generate a list of `True` and `False` values for each row, which then selects the appropriate rows.

In [None]:
# Test for cylinders column = to 5
#  Note '==' is used because a single '=' is used for assignment

five_cylinders = cars[ cars['cylinders']==5 ]
five_cylinders.head()

#### Combining Row and Column selection

A Data Frame can be filtered using column selection and row value filter:
  1. First filter the rows using a column test against the original DataFrame
  1. Next specify the desired subset of columns
  

In [None]:
small_five_cyl = cars[ cars['cylinders']==5 ][['mpg','cylinders','displacement','car name']]
small_five_cyl.head()

Alternatively, this order of operations can be changed.

In [None]:
small_five_cyl_2 = cars[['mpg','cylinders','displacement','car name']][ cars['cylinders']==5 ]
small_five_cyl_2.head()

### Column Operations

Under the hood `Pandas` uses `NumPy` for storage of data.
As such, we are able to do the same mathematical operations on columns as we did on `NumPy` arrays.

In [None]:
import pandas as pd

with open('/dsa/data/all_datasets/auto-mpg/auto-mpg.csv', 'r') as file:
    cars = pd.read_csv(file)
cars.head()

Both `Pandas` and `NumPy` allow you to convert the type of a column to another type.
The `dtype:` describes the data type.
In the case of the `weight` column, the data is a 64-bit integer (whole number).
```
dtype: int64
```

In [None]:
# Look at the type of the weight column

cars['weight'].head()

In [None]:
# Re-assign the whole weight column to a new column of the desired data type

cars['weight'] = cars['weight'].astype(float)
cars.head()

In [None]:
# Look at the new type of the weight column

cars['weight'].head()

We can easily perform basic math on the columns, such as converting the weight column from pounds to tons.

In [None]:
cars['weight'] = cars['weight'] / 2000.0
cars.head()

# Adding and Removing Columns

Columns can be added and removed as well.

Columns are added by simply specifying a new column name and setting it to a series of values of the appropriate length.

Columns are removed by using the Python `del` command and the column reference.

In [None]:
# range(start, end) produces the list of values from 'start' through 'end'-1
# len(cars) determines the length of the cars column, i.e the number of rows
# Then add one to get the last row number

cars['rowNumber'] = range(1,len(cars)+1)
cars.head()

In [None]:
del cars['origin']
cars.head()

This has been just a brief introduction to pandas DataFrames. 
You will see these used continuously throughout the course for loading data, then providing input to statistical and visualization functionality of Python libraries.

# SAVE YOUR NOTEBOOK!!