# Python Fundamentals: Introduction to Pandas

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Import libraries in Python.
* Import .csv files into a Pandas `DataFrame`.
* Use common `DataFrame` methods.
* Select rows and columns from a `DataFrame` using conditions.
</div>


### Icons Used In This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Sections
1. [Libraries](#lib)
2. [Data Frames: Spreadsheets in Python](#df)
3. [Demo: Grouping and Plotting Data Frames](#group)

<a id='lib'></a>

## Libraries

A **library** refers to a reusable chunk of code. Usually, a Python library contains a collection of related functionalities.

We have already been using Python's [standard library](https://docs.python.org/3/library/) - it comes ready and loaded with Python. We've also used `pandas` to work with data frames. Today, we will expand on our Pandas knowledge and do our first data science project.

### Installing libraries

The most common option is to install a library directly using the command line. One way to do this is to go to Jupyter Launcher by clicking on the `+` symbol in the top left of Jupyter Labs, then select Terminal.

You can then use `pip`, a Python package installer, to install new packages. Simply run `pip install [PACKAGE_NAME]`, and the package will be installed.

💡 **Tip**: You can also install packages within a Jupyter Notebook. Create a new cell, and run the command `!pip install [PACKAGE_NAME]`.


### Importing libraries 
Before we can use a library like Pandas, we have to **import** it into the current session.
Importing is done with the `import` keyword. We simply run `import [PACKAGE_NAME]`, and everything inside the package becomes available to use.

Let's import the `numpy` module, which has a lot of useful functions for working with numerical data. Let's access a function from this module using dot notation.

In [None]:
import numpy

print('The mean of [1, 4, 5] is:', numpy.mean([1, 4, 5]))

For many packages, like `numpy`, there is an **alias**, or nickname that they are often imported as. For common packages (especially those with long names), it saves a lot of typing when you use a nickname. For example, `numpy` is usually imported as below:

In [None]:
import numpy as np

print('mean of [1, 4, 5] is:', np.mean([1, 4, 5]))

There are very common abbreviations used for some of the more popular libraries, including:

* `pandas` -> `pd`
* `numpy` -> `np`
* `matplotlib` -> `plt`
* `statsmodels.api` -> `sm`

⚠️ **Warning**: Sometimes aliases can make programs harder to understand, since readers must learn your program's aliases. Be very intentional about using aliases!

### Help!

How do we know what we can do with `numpy`? Usually, packages provide **documentation** which explain these components. Let's have a look at the documentation [online](https://docs.python.org/3/library/math.html). 

Being comfortable sifting through documentation is a **very** important skill!


## 🥊 Challenge 1: Locating the Right Library

You want to select a random value from a list of data.

1. What [standard library](https://docs.python.org/3/library/) would you most expect to help? Look at the documentation and find it.
2. Which **function** would you select from that library? 💡 **Tip**: Look at "Functions for sequences" in the documentation.
3. Import the library, and apply the function to the following list.

In [None]:
years = [1952, 1957, 1962, 1967, 1972, 1977]

In [None]:
# YOUR CODE HERE


<a id='df'></a>

# Data Frames: Spreadsheets in Python

**Tabular data** is everywhere. Think of an Excel sheet: each column corresponds to a different feature of each datapoint, while rows correspond to different samples.

In scientific programming, tabular data is often called a **data frame**. In Python, there a specialized library called Pandas, which contains an object `DataFrame` that implements this data structure.

## 🥊 Challenge 2: From Dictionary to Data Frame

You can easily build a data frame from a dictionary. However, the following code gives an error. Why does it have an error? 

💡 **Tip:** Google the line at the bottom of the error message if you need help!

In [None]:
country = ['Afghanistan', 'Greece']
continent = ['Asia', 'Europe', 'Africa']
life_exp = [28.801, 76.670, 46.027]

country_dict = {
    'fruit': country,
    'length': continent,
    'color': life_exp}

pd.DataFrame(country_dict)

## Finding Data

For the rest of this workshop we will work with the [gapminder-FiveYearData](https://en.wikipedia.org/wiki/Gapminder_Foundation) dataset. The dataset contains data for 142 countries, with values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

First of all, we have to figure out where the data is located! 

We can use the magic command `%pwd` to check the location of your "working directory" (the folder on your computer that Python is currently connected to). 

In [None]:
# print working directory
%pwd

## Importing .csv Files

The file we want to import are inside a folder called "data", which is inside of the main "Python-Fundamentals" folder. As you can see in the file path, this directory is one folder "up" from where we currently are. 

💡 **Tip**: Let's use the File Browser to the left of our screen, as well as our Finder (Mac) / File Explorer (Windows), to orient ourselves. 

In the cell below, we:
* `import` the `pandas` **library**.
* Use the `read_csv()` method, which takes a string as its main argument. This string consists of the file path pointing to the file.
* `../` means 'go up one level in the folder'.
* `data/` means 'go into a folder called "data".
* `gapminder-FiveYearData.csv` is the file name we are accessing within that "data" folder.

In [8]:
import pandas as pd

df = pd.read_csv('../data/gapminder-FiveYearData.csv')
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


Objects can also have **attributes**, or variables associated with the data type. We can get the number of columns and rows with `df.shape`, an attribute of the dataframe. 

In [None]:
df.shape

## More on .csv files
As data scientists, we'll often be working with these **Comma Seperated Values (.csv)** files. 

Comma separated values files are common because they are relatively small and look good in spreadsheet software. A comma separated values file is just a text file that contains data but that has commas (or other separators) to indicate column breaks.

As you see, `pandas` comes with a function [`.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
that makes it really easy to import .csv files.

💡 **Tip**: Let's have a look at a .csv file in our File Browser!

## Slicing Columns
Now that we have our data, we can choose a single column by selecting the name of that column. The act of obtaining a particular subset of a data frame is often referred to as **slicing**. This uses bracket notation to select part of the data.

Check it out:

In [None]:
df['country']

`pandas` calls this a `Series` object. It's like a list, except it's labeled. 

You can slice a Series object just like you can with a list!

In [None]:
gap_country = df['country']
gap_country[0]

`DataFrame` objects also have methods, including those for [merging](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge), [aggregation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and others. Many of these functions operate on a single column of the DataFrame. For example, we can identify the number of unique values in each column by using `.nunique()`:

In [None]:
print(df['country'].nunique())

## More Methods: `.head()`, `.describe()`, and `.value_counts()`

The `.head()` method will show the first five rows of a Data Frame by default. Put an integer in the parentheses to specify a different number of rows. 

`.describe()` provides basic summary statistics. 

`.value_counts()` counts frequencies.

In [None]:
# View the first 3 rows
df.head(3)

In [5]:
# Produce some quick summary statistics
df.describe()

Unnamed: 0,year,pop,lifeExp,gdpPercap
count,1704.0,1704.0,1704.0,1602.0
mean,1979.5,29601210.0,59.474439,7263.152766
std,17.26533,106157900.0,12.917107,9952.559288
min,1952.0,60011.0,23.599,277.551859
25%,1965.75,2793664.0,48.198,1203.161887
50%,1979.5,7023596.0,60.7125,3550.676623
75%,1993.25,19585220.0,70.8455,9397.077688
max,2007.0,1318683000.0,82.603,113523.1329


Now, we can investigate how many of each category?

In [6]:
# How many letters by each writer?
df['year'].value_counts()

779.445314     1
601.074501     1
1619.848217    1
1385.029563    1
1576.973750    1
              ..
752.749726     1
660.585600     1
653.730170     1
665.624413     1
469.709298     1
Name: gdpPercap, Length: 1602, dtype: int64

## 🥊 Challenge 3: Putting Methods in Order

In the following code we want to to find the most frequently occurring continent in the data. Put the following code fragments in the right order!

In [None]:
.head(1)
.value_counts()
df['continent']

## Column names

You can call [attributes](https://medium.com/@shawnnkoski/pandas-attributes-867a169e6d9b) of a Pandas variable by using "dot notation" - it's like a method, but without the parentheses. 

💡 **Tip**: Attributes are **features** of data. Methods **allow you to do something** with data. 

💡 **Tip**: A method is written with parenteses: e.g. `gap.value_counts()`. An attribute is written without parentheses: e.g. `gap.columns`.


In [None]:
# List the column names using the .columns *attribute*
df.columns

🔔 **Question**: Here's another popular attribute: `shape`. What do you think it does?

In [None]:
df.shape

## Slicing Rows

You can slice rows of a DataFrame like you would a string or a list. If we just want three rows: 

In [None]:
df[6:9]

## Conditional Subsetting

What if we operate on our dataset based on some condition? For example, what if we just wanted a subset for data only when country is equal to Egypt? Or only observations from a particular year?

We can use so-called **value comparison operators** for this. They include the following:
* `<` less than
* `>` greater than
* `<=` less than or equal to
* `>=` greater than or equal to
* `==` equal 
* `!=` not equal

For instance, to get only the rows that include data points from Egypt:

In [None]:
df['country'] == 'Egypt'

💡 **Tip**: Fancy terminology alert: the above Series is called a **Boolean mask**. It's like a list of True/False labels that we can use to filter our Data Frame for a certain condition! We'll cover this further in Python Fundamentals II.

Here, we subset our Data Frame with the fancy Boolean mask we just created. 

In [None]:
# Getting only the data points from Egypt
df[df['country'] == 'Egypt']

Note that the output of this operation is a **new data frame**! We can assign it to a new variable so we can work with this subsetted data frame. Let's do it again:

In [None]:
# Creating a new data frame with data from 2002
year_2002_df = df[df['year'] == 2002]
year_2002_df.head()

## 🥊 Challenge 4: Subsetting Data Frames

Subset the data frame to include only people with a life expectancy lower than 50.

In [None]:
# YOUR CODE HERE


## Creating a new Column

To create a new column, use the `[]` brackets with the new column name at the left side of the assignment. Note that we can just throw in another column which we do some calculations on:

In [None]:
df['lifeExp_rounded'] = df['lifeExp'].round()
df.head()

<a id='group'></a>

# 🎬 Demo: Grouping and Plotting Data Frames

There is a lot more you can do in Pandas. Take your learning further with D-Lab's [**Python Intermediate**](https://github.com/dlab-berkeley/Python-Intermediate-Pilot) or [**Python Data Wrangling**](https://github.com/dlab-berkeley/Python-Data-Wrangling) workshops. 

[Register now](https://dlab.berkeley.edu/training/upcoming-workshops) to learn more skills in Pandas, such as `groupby()` a powerful operation that allows you to split data into groups based on some criteria. 

Here's a small demo of what you'll learn:

In [None]:
import matplotlib.pyplot as plt

# Create new column with life expectancy sorted into 5 bins
df['lifeExpBins'] = pd.cut(df['lifeExp'], 5)

# Grouping by continent, get the "gdpPercap" column
df_grouped = df.groupby('continent')['lifeExpBins'].value_counts()

# Pivot the table and put it in a new DataFrame
ag = df_grouped.unstack()

# Plot barchart
ag.plot.bar();

# 🎉 Well done!

Today's project took us through basic data manipulation and analysis using Pandas. 

### 💡 Tip: More workshops!

D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.

- To learn more about data wrangling, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).
- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).

<div class="alert alert-success">

## ❗ Key points

* Import a library into Python using `import <libraryname>`.
* Data frames allow you to work with tabular data (think Excel in Python).
* A .csv file is just a text file that contains data separated by commas.
* Use the `pandas` library to work with data frames.
* Data frames are typically assigned as `df`.
* `DataFrame` columns can be indexed using square brackets - e.g. `df[last_name]` indexes a column called "last_name" in `df`.
* Use the `.describe()` method on a `DataFrame` to get basic summary statistics.
    
</div>