<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module8_pandas1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://cdn.pixabay.com/photo/2019/09/08/19/54/panda-4461766_1280.jpg' width=700>  
Photo by qgadrian production from Pixabay

# APEX Faculty Training, Module 8: Pandas Part 1

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: Feb 13, 2022  

**Learning outcomes**  
1. Learn to read in spreadsheet data ("dataframes") in Python with the Pandas library.
2.  Learn to manipulate the contents of a dataframe with Pandas methods

## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`
        


## 2. What is (are?) Pandas?  
Pandas is a Python library that provides tools for working with rows and columns of data. In other words, it's useful for working with  data in spreadsheet form. In Pandas, however, we use the term "dataframe" rather than spreadsheet.  

Pandas is used to organize, clean, and view data, similar to what you might do in Excel or Sheets. This library is useful because it:
* Can easily work with large data sets
* Is much easier to write code that organizes and cleans your data than do these tedious tasks by hand

## 3. Pandas Dataframes
Dataframes are Python data types, just as integers, floats, strings, and lists are. Dataframes have special properties and methods that can be applied to them. As you can see in the example below, a dataframe looks much like a spreadsheet.  

<table>
<tr>
<th></th>
<th>state</th>
<th>totalPop</th>
<th>hispPop</th>
</tr>
<tr>
<td>0</td>
<td>Alabama</td>
<td>4779736</td>
<td>185602</td>
</tr>
<tr>
<td>1</td>
<td>Alaska</td>
<td>710231</td>
<td>39249</td>
</tr>
<tr>
<td>2</td>
<td>Arizona</td>
<td>6392017</td>
<td>1895149</td>
</tr>
<tr>
<td>3</td>
<td>Arkansas</td>
<td>2915918</td>
<td>186050</td>
</tr>
<tr>
<td>4</td>
<td>California</td>
<td>37253956</td>
<td>14013719</td>
</tr>
</table>

### 3a. Creating Dataframes  
Typically, dataframes are created by reading in a file. The most common file type used with Pandas is a CSV file, e.g., `my_file.csv`. Files created in Excel and Sheets can be exported as CSV files, so if you have an existing dataset, you can save it as a CSV file and read it in with Pandas.  

### 3b. Using Pandas  
A Python library, such as Pandas, has to be imported before its functions can be used. It is best to include this import code at the top of a given notebook. 

To import Pandas, use the keyword `import` followed by the library name, `pandas`. It is also standard to use an abbreviation when importing libraries; for pandas, this abbreviation is `pd`. Putting this all together, we can import Pandas with the following line of code:

`import pandas as pd`  

<font color='red'>Exercise 1</font>  
Copy and paste the code above into the cell below, making sure to run the cell. You won't see any output, but this step is necessary to use Pandas functions in subsequent exercises.

## 4. Reading in Files
How does one open a CSV file in Python? In Excel or Sheets, you would normally click the "open" icon and select the file of interest. The process is a bit different with Python, but generally involves the same steps: telling Python that you want to open (or "read") a file, and then specifying which file it is that you want to read.

Pandas has a function `read_csv()` that reads in csv files and converts the content to a dataframe. The generic syntax looks like this:

`my_df = pd.read_csv(filepath)`  

Breaking down this syntax:
* `my_df` is a generic variable name that represents the dataframe that you're creating. You can, of course, choose any variable name that you like.
* `pd` is the abbreviation for Pandas; essentially, we're saying, "I want to use a function within the Pandas library"
* `read_csv()` is the specific function within Pandas that we'll use to read in the CSV file and create the dataframe
* Finally, `filepath` is the location and name of the CSV file in question

For the purposes of this tutorial, we'll be using a CSV file stored on GitHub, which can be accessed via a URL. Below, you'll see that we created the variable `filepath` and assigned it a string with the desired URL.

In a future module, we'll teach you how to create your own GitHub account so that you can drop files there and obtain URLs, and we'll also teach you an alternate approach in which you can instead add files to Google Drive. The latter involves a couple extra steps but is nonetheless quite doable.

<font color='red'>Exercise 2</font>  
One line 1 below, we've defined the filepath for you. Insert a new line in the cell, and use the code provided above to create a dataframe named `my_df` that uses this filepath. Be sure to run the cell when you're done!

In [None]:
filepath = "https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/state_pop.csv"

## 5. Viewing Dataframes  
### 5a. `head()` and `tail()`
Now that we've created the dataframe, let's check to make sure it looks as we expect it to – data arranged in rows and columns. Pandas has a helpful method `head()` that displays the header (i.e., column names) and first few rows of a dataframe. The syntax for using this method is as follows:

`my_df.head()`  

<font color='red'>Exercise 3</font>  
Copy and paste the code from above into the cell below, and run the cell. You should now see the first few rows of the dataframe you created in Exercise 2.

In [None]:
my_df.head()

By default, the `head()` method displays the first 5 rows of data. Providing a number in the parentheses will instead show a specific number of entries. For instance: `my_df.head(6)` would display 6 rows, rather than the default of 5 rows.  

Similarly, you can use `tail()` to display the last few rows of a dataframe. The default for this method is also 5 with the option to specify a different number as desired. The syntax for this method is:

`my_df.tail()`

<font color='red'>Exercise 4</font>  
In the cell below, display the first 7 rows of `my_df`.

<font color='red'>Exercise 5</font>  
Next, display the last 3 rows of the dataframe.

If you would like to see the *entire* dataframe, you can simply type its name, in this case, `my_df`. This works fine for small dataframes like the one we're using in this tutorial, but it's typically not helpful to spit out a huge dataframe in your notebook. So, you may not use this command very often!

### 5b. Single Column  
You can display a single column of a dataframe by using either the Attribute approach or the Label approach.

The syntax for the **Attribute approach** is as follows, with `col_name` being a generic stand-in for the column in which you're interested:

`my_df.col_name`   

The syntax for the **Label approach** is as follows. Importantly, note that when using this approach, you must put the column name in quotes:

`my_df['col_name']`  

<font color='red'>Exercise 6</font>  
Run the cell below to see the column `totalPop` displayed using the **Attribute** approach.

In [None]:
my_df.totalPop

<font color='red'>Exercise 7</font>  
Now try displaying the column hispPop in the cell below using the **Label** approach. Refer to the syntax just above!

## All done!
Congrats on finishing the Pandas module! In the next module, we'll dive deeper into working with dataframes using the Pandas library.