# Data Analysis with `pandas`

`pandas` is the most common package used in data analysis, with a focus on data manipulation and processing. We have alluded to `pandas` when talking about DataFrames and libraries. Now we will dive into a few key concepts in the `pandas` package.

To learn more, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).

In [None]:
# recall that pandas is frequently imported with the alias pd
import pandas as pd
import numpy as np

For now, let's use an existing dataset, the [penguins dataset](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data?resource=download&select=penguins_size.csv)! The dataset consists of body measurements for three penguin species (Adelie, Chinstrap, Gentoo). We will load in the file and use `df.head()` to look at the first few items.

The data has the following columns: 

- Species (Adelie, Gentoo, Chinstrap)
- Island
- Culmen Length (mm)
- Culmen Depth (mm)
- Flipper Length (mm)
- Body Mass (g)
- Sex (Male / Female)

If you were wondering, the culmen is the top part of the penguin's bill!

**Qeustion:** How many rows/columns are in the data set?

In [None]:
penguins = pd.read_csv('penguins.csv')
penguins.head()

## DataFrame Methods

Just like other objects, `DataFrames` have a series of methods that are associated with them. There are many methods for summarizing `pd.DataFrames`. For example [`df.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) will give some summary statistics for a column. Let's look at how `.describe()` works on the `penguins` DataFrame.


**Question:** Why are only some of the columns in the DataFrame visible in the output below?

In [None]:
penguins.describe()

This function is good for summarizing numerical data in a dataset. However, sometimes this might not be enough. For example, what if we wanted the median of the penguin mass rather than the mean? 

First, let's select just one column to operate on. We can select an individual column with bracket notation. This is analogous to indexing a list.

**Question:** What is the type of the output?

In [None]:
penguins['body_mass_g']

A single column of pandas is a `Series` object. This can be treated as a list or other iterable, and allows for you to do calculations over it. 

We can then look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) to see the methods and attributes that are available for `Series` objects. If we want the median, we can use the `.median()` function.

In [None]:
penguins['body_mass_g'].median()

We can also do operations on a column. 

**Question:** What will happen in the code below? What is the type/shape of the output?

In [None]:
penguins['body_mass_g']/1000

This is called a **vectorized operation:** where the operation is applied to each element of the column. This allows you to efficiently apply operations to every item of the Series. However, knowing when something will be vectorized and when it won't can sometimes be a challenge. 

A variation on this is an operation containing two columns. Let's say we want to take the ratio of the culmen length and depth for all of the penguins.

**Question:** The code below has two errors in it. What is it trying to do, and how do you fix it?

In [None]:
penguins['culmen_ratio'] = penguins['culmen_length_mm']/penguins['culmen_depth_mm']
penguins['culmen_ratio']

In [None]:
penguins.describe()

## Challenge 1: Methods

For each of the following methods, answer the following questions:
1. Is the method operating on a `DataFrame` or a `Series` object?
2. Look up the documentation. What type is the output?
3. Run the method. Note any discrepancies from your prediction.
**Bonus:** If you run the method on the opposite type, what happens? (runs-same output, runs-different output, error)

In [None]:
penguins['species'].value_counts(ascending=True)

In [None]:
penguins.isnull()

In [None]:
penguins.dropna()

In [None]:
penguins['species'].str[0]

There are easily several hundred methods asociated with `DataFrames` and `Series`. For this reason, it is impractical to try to memorize every function and its arguments. Rather, it is often more productive to focus on developing (1) an understanding of what is possible with Python and (2) the ability to learn how to implement new functions by reading documentation, examples, etc.

## Challenge 2: Categorical --> Numeric data

1) Recall that in the penguins data set, there was one column that had two values 'MALE' and 'FEMALE'. Let's say that for a model, we want to replace the string values with numbers (FEMALE = 0; MALE = 1) that will serve as input to the model. Look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and identify a method to *replace* the strings with their corresponding numbers. Then try to implement the method. What roadblocks do you run across?

In [None]:
#YOUR CODE HERE
penguins['sex_numeric'] = penguins['sex'].replace({'FEMALE':0,'MALE':1})
penguins.head

2)  Notice that there are some 'NaN' values in the `Series`. You do some research and identify three possible solutions to deal with the NaN values (listed below). For each of the options, describe what will happen to NaN values in the column, and the DataFrame as a whole. Which option seems most appropriate?

Consider the following:
- Is the whole DataFrame or just the column (Series) being operated on?
- What exactly are happening to the NaN values?
- What are the consequences, if any, for the solution in the hypothetical model? 

**Hint:** The documentation is your friend!

In [None]:
penguins['sex'].replace(['MALE','FEMALE',np.nan],[1,0,2])

In [None]:
penguins.dropna(subset = 'sex') ##Drops all 

In [None]:
penguins.fillna(2)

**Question:** What was the most helpful tool/strategy for figuring out which method to use?

## Selecting Columns and Rows

We can use `.loc[row, column]` to index columns and rows in the DataFrame. This is a complex topic, so we will cover just the most common case here. Most commonly, `.loc[]` is used to subset to a selection of rows in the DataFrame. 

In this case, use a **Boolean mask** to represent which rows to select. A Boolean mask is an operation that takes as input a series and a condition, and outputs a series with `True` where that condition is met and `False` elsewhere. Sound familiar? The function `.isnull()` is a function that uses Boolean masks! 

We can use Boolean masks with `.loc[]` to subset our DataFrames! For example, let's say that we want to measure some variables for penguins found on Torgersen island. Then we would simply select that column and use `==` to check if the island variable in that column is exactly 'Torgersen'. **Note:** How this is formulated now, this has to be an *exact* match.


In [None]:
penguins['island']=='Torgersen'

Then to get the subset of the entire `penguins` object, we can pass this Boolean mask to `.loc[]`:

In [None]:
penguins.loc[penguins['island']=='Torgersen']

Now, if you wish to subset this DataFrame for columns as well as rows, you can include a columns argument in `.loc[]` that includes a list of columns to subset.

In [None]:
# Select the species column, all rows
penguins.loc[penguins['island']=='Torgersen',['island','species','sex']]

penguins.loc?

## Challenge 3: Subsetting a DataFrame

1. Modify the .loc[] expression above to subset for all Adelie penguins and save it to a variable `adelie`
2. Calculate the mean body mass for this species (**Hint**: use `.mean()`).
3. Repeat 1-2 for Gentoo and Chinstrap penguins.

In [None]:
##your code here
adelie = 

#gentoo = 

#chinstrap = 

## Plotting with `pandas`

We often want to look at our data visually. Fortunately, `pandas` also offers some basic plotting functions that can be useful in exploring a data set. In this section, we will cover two basic types of plots: histograms and scatter plots. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for further information on plotting and plot customization.

### Histograms

A histogram shows the distribution of a variable using binned values. We can call this using the syntax: `df[column].plot(kind='hist')`.

The `bins` keyword argument changes the number of bins in the histogram. A few examples of the bins argument are below.

**Question:** Which plot would you pick? Why? What do you notice about the distribution of the data?

In [None]:
print('Plot A: 5 Bins')
fig = penguins['body_mass_g'].plot(kind='hist', title='Histogram of body mass values', bins=5)

In [None]:
print('Plot B: 10 Bins')
fig = penguins['body_mass_g'].plot(kind='hist', title='Histogram of body mass values', bins=10)

In [None]:
print('Plot C: 20 bins')
penguins['body_mass_g'].plot(kind='hist',
                             title='Histogram of body mass values', bins=20)

### Scatter Plots

Scatter plots visualize bivariate relationships. We can create a scatter plot by specifying the columns to use for the `x` and `y` axes. Notice that instead of calling it on a single column of data, we are using `df.plot(kind='scatter')`.

**Question:** Do you notice any pattern in the data? What might be causing that pattern?

In [None]:
fig = penguins.plot(kind='scatter',
              x='culmen_length_mm',
              y='culmen_depth_mm',
              title='Relationship between culmen length and depth')

## Challenge 4: Customizing a Plot

One intuition may be that different penguin species have different culemtn length/depth, resulting in the pattern observed in the scatterplot above. Let's say we want to explore this pattern by plotting the data for each species in a different color. This will allow us to visualize this pattern if is present.

The way we implement this in plotting is by plotting individual layers for each species. Most visualizations treat images as "layers" on the backend. This allows us to create customizations to plots pretty easily, because each customization would be a new "layer".

So let's try it! Specifically, we want to visualize the culmen depth vs. the culmen length for each of the penguin species separately. We'll use different colors for each species.

To do this, we set the first layer equal to the variable `fig`. This represents our plot. All of our plots thus far have had a single layer. To include multiple layers in a plot, we simply include the argument `ax=fig` in any subsequent layers. This tells `pandas` to put new layers on the original plot rather than to make a new plot.

Follow the steps below to make your own layered visualization!

1. Make three different sub-DataFrames, one for each species, using `.loc[]` and a Boolean mask. (**Hint:** This is the solution to Challenge 3)
2. Plot the first layer and set it equal to `fig`.
3. Plot subsequent layers. Use a different color for each species (look at the documentation for the name of the color parameter). Some possible colors to use are `'green'`, `'red'`, `'purple'`, `'black'`, etc. (Remember to include the argument `ax=fig`!)
4. Do you notice a pattern in culmen measurements based on species? What other elements for the plot would be helpful for interpreting it?

**Bonus:** Add a title and any other modifications to the plot (better x and y labels, for example).

In [None]:
# YOUR CODE HERE

# Subset Data 
chinstrap = 
adelie = 
gentoo = 

# Create plot
fig = # First layer
# Plot other layers


For more on data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).