# Loading Data into Pandas

In the previous lesson you explicitly encoded your data by hand in Python data structures. This approach to managing data does not work when you have larger amounts of data. This lesson will show you the basics functions for loading tabulated data that has been saved as a Comma Separated Values (CSV) file into a Pandas DataFrame. Additionally, this lesson will demonstrate how to do some basic exploration of the data values as well as show how to read metadata about the DataFrame and the types of values. Pandas functions for loading data files are very robust, this lesson will only show the basics necessary for getting started with loading data.

## Learning Objectives

By the end of this lesson, you will be familiar with

- Opening CSV data files and loading them into a Pandas DataFrame with the `read_csv()` function.
- Displaying the partial contents of a DataFrame with the `head()` and `tail()` methods.
- Inspect the DataFrame and see the number of columns and rows as well as their data types.
- Understand what type of data the `Object` data type represents. 
- Know how to interpret the results of the `info()` method.
- How to save a DataFrame to disk in an Excel format.


TODO loading data directly from the web. 

## Data used in this lesson

- [Pittsburgh Neighborhood Data](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf) - This dataset was extracted from a 2010 Pittsburgh Neighborhood Profiles PDF document. This dataset includes rankings of each neighborhood across a variety of metrics including population, age, race, and housing. The dataset represents information from 2010. Each row is a neighborhood and there are 90 entries.
    - The datset is stored as a CSV file with the name `pgh-neighborhood-data-2010.csv` in the `datasets` directory.

The data are formatted as a CSV file. You will be using Pandas to read this CSV file into a Dataframe and then write it back to disk as an Excel spreadsheet. 


In [None]:
# load pandas
import pandas as pd

## Reading Data into a Dataframe

In the previous lesson we had to type all of our data into the code and create Pandas data structures. This is laborious and takes a lot of time. When we have large datasets it doesn't make sense to type them out or even copy and paste them. Fortunately, Pandas provides a function, `read_csv`, that makes reading data files into a pandas dataframe very easy!

The `read_csv` function is a top level Pandas function, so it is called via the alias `pd`. The `read_csv` function has many, many parameters. Refer to [the Pandas `read_csv` documentation for more information.](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). In the example below, the file path of a CSV file, `my_data.csv` in the `data_pile` directory is loaded into a DataFrame called `example_dataframe`.

```python
# read a data file called "my_data.csv" into a dataframe called example_dataframe
example_dataframe = pd.read_csv("data_pile/my_data.csv")
```

#### Task - Loading Data into Pandas from Disk

- Copy the example above, but open a data file called "pgh-neighborhood-data.csv" in the "datasets" directory and give it a variable name of "pgh_neighborhood_data". Remember, to 

In [None]:
# Your code here


#### Answer - Loading Data into Pandas from Disk

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

pgh_neighborhood_data = pd.read_csv("datasets/pgh-neighborhood-data.csv")

## Displaying DataFrames

Once you load a CSV file into a DataFrame, you can begin to manipulate it. The first step to working with data in Pandas is looking at it to see if it loaded correctly.

#### Task - Displaying a Dataframe

1. Put the name of the variable containing the dataframe in the code cell below.
2. Look at the output, consider what portions of the dataframe are being displayed in the output.

In [None]:
# your code here


#### Answer - Displaying a Dataframe

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
pgh_neighborhood_data

## Heads and Tails

By default, pandas will display the first and last five rows of a large dataframe. However, sometimes you may want to see a specific number of rows from the beginning (or end) of the dataframe. Pandas dataframes have the `head(N)` function which takes an argument, `n`, which is a number to indicate how many rows to display.

```python
# display the first three rows
my_dataframe.head(3)
```

#### Task - Displaying Specific Rows of a Dataframe
1. Use the `head()` function to dispay the first 8 rows of the Pittsburgh Neighborhood dataset
2. Try a much larger number and see what happens.

In [None]:
# your code here


#### Answer - Displaying Specific Rows of a Dataframe

Click on the ellipses (...) below to see the answers.

In [None]:
# answer task 1
pgh_neighborhood_data.head(8)

In [None]:
# answer task 2
pgh_neighborhood_data.head(80)

If you picked a large enough number, you may have noticed Pandas defaulted back to the first and last 5 rows. There is a limit to how many rows Pandas will display in the output. See if you can find out what that limit is.

#### Task - Finding the Max Number of Rows to be Displayed

1. Try different numbers in the `head` function to see what largest number you can use as an argument for the head function before Pandas defaults to the first and last 5 rows.
2. BONUS: Do some googling to see if you can figure out how to change that limit. Set the number to 90 so every neighborhood in Pittsburgh can be displayed.
3. BONUS: Reset the number back to the default value

In [None]:
# your code here


#### Answer - Finding the Max Number of Rows to be Displayed

Click on the ellipses (...) below to see the answers.

In [None]:
# answer 1
pgh_neighborhood_data.head(60)

In [None]:
# bonus answer 2
pd.set_option('display.max_rows', 90)
pgh_neighborhood_data

In [None]:
# answer 3
# bonus answer
pd.set_option('display.max_rows', 60)
pgh_neighborhood_data

#### Task - Displaying the last rows of a dataframe

Pandas Dataframes have another function for displaying the last `n` rows of a dataframe. 

1. Guess what the name of that function and display the last 8 rows of the dataframe

In [None]:
# your code here


#### Answer - Displaying the last rows of a dataframe

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
pgh_neighborhood_data.tail(8)

## How Many Rows and Columns

There are several ways to see how many rows and columns are available in a dataframe. If you let Jupyter display your dataframe it will automatically display the row x columns at the bottom of the output (see example below). 

In [None]:
# display the dataframe
pgh_neighborhood_data

However, if you need to *programmatically* access information about the size of the dataframe you have a couple options. You could use the built-in Python function `len()`, however this will only give you the number of rows. 

In [None]:
len(pgh_neighborhood_data)

The `shape` *property* of Pandas dataframes provides an easy way to programmatically access both the number of rows and columns of a dataframe.

#### Task - Checking the shape

1. Use the `shape` property to check and see how many rows and columns are in `pgh_neighborhood_data`.
    - Remember properties are not callable like functions, they are like variables with data.
2. What is the data type of the result?
3. Can you select just the number of rows using Python indexing?

In [None]:
# Your code here


#### Answer - Checking the shape

Click on the ellipses (...) below to see the answers.

In [None]:
# Answer 1
pgh_neighborhood_data.shape

In [None]:
# Answer 2
type(pgh_neighborhood_data.shape)

In [None]:
# Answer 3
pgh_neighborhood_data.shape[1]

## Inspecting Data Types


When you read data into a Dataframe, each column of data are assigned a specific datatype. Pandas will inspect the data to make an educated guess as to the data type of the values from the CSV file. You can use the `dtypes` property to inspect the type of data Pandas guessed.

Here is the syntax for using the `dtypes` property:
```python
<dataframe_variable>.dtypes
``` 

Note, `dtypes` is not a function, it is a *property*. This means you aren't executing something you are displaying its value. So no parentheses are required.

#### Task - Inspecting Data Types

1. Use the `dtypes` property to get a list of the data types for each column
2. Look at the results. Do they make sense given the portions of the dataframe you inspected above?
3. Is there a column data type that doesn't make sense?

In [None]:
# your code here



#### Answer - Inspecting Data Types

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
pgh_neighborhood_data.dtypes

## The `object` type

Pandas is pretty good at identifying and properly assigning numeric data types. When working with textual or categorical data, pandas currently defaults to `object`, which means the data is being represented as a Python data type.


#### Task - What's in an Object?

1. Compare the data types from the output of `dtypes` with the display of the dataframe using `head`. Do the types assigned by pandas make sense given your understanding of the data? 
2. Look at the data type for `Neighborhood` what Python data type do you think Pandas is using?

In [None]:
# your code here


#### Answer - What's in an Object? 

Click on the ellipses (...) below to see the answers.

*answer*

Pandas assigned Neighborhood column to the "object" data type. The data are strings, so Object must represent strings. There is a `String` datatype in Pandas, but it is not assigned by default. You can [read the user guide about text data types](https://pandas.pydata.org/docs/user_guide/text.html#text-data-types) for more information about representing textual data. Pandas also has a [categorical data type](https://pandas.pydata.org/docs/user_guide/categorical.html), but they are outside the scope of Data Basics.

### Getting more info 

Pandas dataframes have a function, `info()` that provides useful meta-information about the dataframe.
- Column index position and names
- Number of non-null values in the column
- Data type of each column and how many of each type
- The size and type of the row index
- The amount of memory 

#### Task - Checking Memory Usage

1. Use the `info()` function to see how much *memory* is being *used* by `pgh_neighborhoods_data`. 

In [None]:
# Your code here


#### Answer - Checking Memory Usage

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
pgh_neighborhood_data.info()

The `pgh_neighborhood_data` is using around 10 kilobytes of memory.

## Writing Dataframes to disk

Pandas provides a variety of mechanisms for saving your data as a data file. In the cell below, place your cursor next after the `to_` and hit tab to show  the autocomplete suggestions. 

In [None]:
pgh_neighborhood_data.to_

Don't worry if you don't understand all of the available functions at this time. 

- To save to a CSV file, use `to_csv()`
- To save to a JSON file, use `to_json()`
- To save to an Excel file, use `to_excel()`

The first argument when calling one of these functions will be the filename to write to disk. So if we wanted to create a second copy of our neighborhood dataset as a CSV file, we could write the following code.

```python
pgh_neighborhood_data.to_csv("pgh-neighborhood-data-2.csv")
```

#### Task - Save data as an Excel file

1. Use the `to_excel()` function to save the `pgh_neighborhoods_data` as an excel file called "pgh-neighborhoods-data.xlsx". Set the `index` keyword argument to `False` to prevent the number index to be saved. You can also set the `sheet_name` keyword argument to a string of your choice to name the Excel sheet. 

In [None]:
# your code here


#### Answer - Saving data as an Excel File

Click on the ellipses (...) below to see the answers.

In [None]:
# Answer 1

pgh_neighborhood_data.to_excel("pgh-neighborhoods-data.xlsx", index=False)

#### Task - Save the data as a tab separated values file

Pandas doesn't appear to have a function for saving data as a tab separated values file. 

1. Review the [documentation for the `read_csv()` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv). What parameter would you use to specify tab seprated instead of comma separated?
2. Write the `pgh_neighborhood_data` to a tab seprated values file with a `.tsv` file extension. Remember, you can represent tab using the `\t` escape character.


In [None]:
# your code here

#### Answer - Save the data as a tab separated values file

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
pgh_neighborhood_data.to_csv("pgh-neighborhoods.tsv", sep="\t")