<a href="https://colab.research.google.com/github/NIP-Data-Computation/show-and-tell/blob/master/piercel_week2_notes1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Author**: Pierce Lopez <br>
**Date Created**: August 10, 2020 <br>
**Last Updated**: August 11, 2020 <br> 
**Description**: Contains my notes on the Data Analyst lesson: _Data Manipulation with pandas_.

# Data Manipulation with pandas
For this chapter, we will make use of the `pandas`, `NumPy`, and `matplotlib.pyplot` functions so do not forget to import the necessary modules!

```
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```
## Chapter 1: Transforming Data

### Section 1: Introducing DataFrames

Starting with a recap:

1. pandas is a high-level data analysis and visualization tool built-on the NumPy and Matplotlib modules.
2. Ways to explore a DataFrame:
  * `dataframe.head()` - displays first few rows
  * `dataframe.info()` - displays DataFrame information (column names, column data types, missing values)
  * `dataframe.describe()` - displays summary statistics for numeric columns
  
  **DataFrame Attributes**
  * `dataframe.shape` - displays DataFrame dimensions (ordered pair)
  * `dataframe.values` - displays row values in a 2D NumPy array
  * `dataframe.columns` - displays column names
  * `dataframe.index` - displays row names/indices

**Note:** That last two attributes are Index Values, which will be discussed in a nother chapter!

<br>

### Section 2: Sorting and subsetting

1. Sorting
  `dataframe.sort_values(["col_name1", "col_name2"])` 
  * sorts `col_name1` values then sorts `col_name2` values in ascending order
  * setting `ascending = False` will sort in a descending fashion
  * **Note:** `ascending = [list_of_boolean_values]` to set manner of sorting per every individual column.

2. Subsetting columns (recap)
  * `dataframe["col_name"]` - for single columns
  * `dataframe[["col_name1", "col_name2"]]` - for multiple columns
    * **Recall:** Double square brackets are used to keep the DataFrame data structure!

3. Subsetting rows by comparison (recap)
  * `filter = dataframe["col_name"] > 50` - a logical filter
  * `dataframe[filter]` - passing the logical filter
    * **Note:** Multiple filters can be passed simultaneously using logical operators.

```
# pass multiple filters at once
dataframe[(filter1) & (filter2)]
```

4. Subsetting rows using the `isin()` method.
  * `filter = dataframe["col_name"].isin(["val1", "val2"])` - a filter
  * `dataframe[filter]` - passing the filter

<br>

### Section 3: New columns

Adding a new column

  * `dataframe["kilometer"] = dataframe["meter"] * 1000`
  * `dataframe["sq_km"] = dataframe["kilometer"] ** 2`

<br>

## Chapter 2: Aggregating Data

### Section 1: Summary statistics

1. Common summary statistics:

  * `.mean()` <br>
  * `.median()` <br>
  * `.mode()` <br>
  * `.min()` - minimum <br>
  * `.max()` - maximum <br>
  * `.var()` - variance <br>
  * `.std()` - standard deviation <br>
  * `.sum()` <br>
  * `.quantile()`

**Note:** these can work on multiple columns!

```
mean = dataframe["col_name1", "col_name2"].mean()
``` 
<br>

2. The `.agg()` method
  * Allows us to solve for customized statistical values. These customized statistical values can be made by defining functions.

```
# calculate square of mean
def sq_mean(column) :
  return column.mean() ** 2

dataframe["col_name1"].agg(sq_mean)
```

**Note:** `.agg()` can be used on multiple columns and it can also be used to get multiple customized statistical values!

```
# calculate square of mean
def sq_mean(column) :
  return column.mean() ** 2

# calculate cube of mean
def cb_mean(column) :
  return column.mean() ** 3
dataframe[["col_name1", "col_name2"]].agg([sq_mean, cb_mean])
```

2. Cumulative statistics

  * `.cumsum()` - cumulative sum <br>
  * `.cummax()` - cumulative maximum <br>
  * `.cummin()` - cumulative minimum <br>
  * `.cumprod()` - cumulative product

<br>

### Section 2: Counting

1. Dropping duplicates (to avoid miscounts)
```
unique = dataframe.drop_duplicates(subset = "col_name")`
```

**Note:** We can have two values of the same name but actually referring to two different things. In this case, we will use another subset that differentiates the two of them!

```
unique_corrected = dataframe.drop_duplicates(subset = ["col_name1", "col_name2"])
```

2. Counting

```
# counting 
unique_corrected["col_name"].value_counts()

# counting and sorting
unique_corrected["col_name"].value_counts(sort = True)

# counting, sorting, and proportionalizing
unique_corrected["col_name"].value_counts(sort = True, normalize = True)
```

<br>

### Section 3: Grouped summary statistics

We can take summary statistics by group using the `groupby()` function to avoid repetitive typing!

```
# get group statistics
dataframe.groupby("column_you_want_to_group")["data_column_you want_to_get_statistics_on"].mean()

# get multiple group statistics
dataframe.groupby("column_you_want_to_group")["data_column_you want_to_get_statistics_on"].agg([mean, min, max])
```

**Note:** We can also group using multiple variables!
```
# get multi-group statistics
dataframe.groupby(["column_you_want_to_group1", "column_you_want_to_group2"]))["data_column_you want_to_get_statistics_on"].mean()

# get multi-group multi-statistics
dataframe.groupby(["column_you_want_to_group1", "column_you_want_to_group2"]))[["data_column_you want_to_get_statistics_on1", "data_column_you want_to_get_statistics_on2"]].mean()
```

<br>

### Section 4: Pivot tables

Pivot tables are similar to the `groupby()` function.

```
# the two lines (groupby vs pivot table) are similar
dataframe.groupby("column_you_want_to_group")["data_column_you want_to_get_statistics_on"].mean()

dataframe.pivot_table(values = "data_column_you want_to_get_statistics_on", index = "column_you_want_to_group")
```

**Note:** We can change the output statistics we want using the `aggfunc` argument.

```
# change output statistics
dataframe.pivot_table(values = "data_column_you want_to_get_statistics_on", index = "column_you_want_to_group", aggfunc = np.median)

# multiple output statistics
dataframe.pivot_table(values = "data_column_you want_to_get_statistics_on", index = "column_you_want_to_group", aggfunc = [np.median, np.mean])
```

We can also use `pivot_table` on two variables:

```
# the two lines (groupby vs pivot table) are similar
dataframe.groupby(["column_you_want_to_group1", "column_you_want_to_group2"]))["data_column_you want_to_get_statistics_on"].mean()

dataframe.pivot_table(values = "data_column_you want_to_get_statistics_on", index = "column_you_want_to_group1", columns = "column_you_want_to_group2")
```

**Note:** The outputs of the two lines of code above are the same but presented in different ways. The latter produces a more expanded version with lots of `NA` values to signify _missing data_. We can change those `NA` values (with 0, for example) using the `fill_value` argument.

```
# replace NA values with 0
dataframe.pivot_table(values = "data_column_you want_to_get_statistics_on", index = "column_you_want_to_group1", columns = "column_you_want_to_group2", fill_value = 0)
```

We can also add a `margins = True` argument to add a new row and column which specify the mean for each column and row **NOT INCLUDING** missing values.

```
# get mean for each row and column of the pivot table
dataframe.pivot_table(values = "data_column_you want_to_get_statistics_on", index = "column_you_want_to_group1", columns = "column_you_want_to_group2", fill_value = 0, margin = True)
```

<br>


## Chapter 3: Slicing and Indexing

### Section 1: Explicit indexes

1. Setting a column as indices (row labels) <br>
`dataframe.set_index("col_name")`

2. Removing an index <br>
`dataframe.reset_index()`

3. Dropping an index <br>
`dataframe.reset_index(drop = True)`

  * **Note:** This completely removes the index column from the DataFrame as well, so be careful!

4. Using `.loc` to subset indexes <br>
`dataframe.loc["index_name"]`

5. Multi-level indexing <br>
`dataframe.set_index(["col_name1", "col_name2])`
  * Subsetting the outer level of a multi-level index (by a list):
    * `dataframe.loc[["outer_index_name1", "outer_index_name2"]]`
  * Subsetting the inner level of a multi-level index (by ordered pairs in list)
    * `dataframe.loc[[("outer_index_name1", "inner_index_name1"), ("outer_index_name2", "inner_index_name2")]]`

6. Sorting index values <br>
  * Single-level index:
    * `dataframe.sort_index()`
  * Multi-level index: levels and manner of ordering can be changed
    * `dataframe.sort_index(level = [], ascending = [])`

<br>

### Section 2: Slicing and subsetting with `.loc` and `.iloc`

This section shows a recap of Chapter 2 (Section 4) of the previous lesson (Intermediate Python). 

<br>

1. Sort the index before you slice! <br>
`dataframe.sort_index(level = [], ascending = [])`

2. Slicing correctly!
  * Outer-level index:
    * `dataframe.loc["outer_level_index1":"outer_level_index2"]`
  * Multi-level index:
    * `dataframe.loc[("outer_level_index1","inner_level_index1"):("outer_level_index2","inner_level_index2")]`
    * **Note:** We don't need extra square brackets because we used the ":" for calling!
    
<br>

### Section 3: Working with pivot tables

This section shows a recap of Chapter 2 (Section 4) of the current lesson (Data Manipulation with pandas). 

<br>

**Additional insights:** 

1. We can get summary statistics across rows or columns of a pivot table using the `axis` argument

```
# mean across index values
pivottable.mean(axis = "index")

# mean across column values
pivottable.mean(axis = "columns")
```

2. We can access the date components using the `.dt.year`, `dt.month`, and `dt.day` DataFrame attributes!

<br>


## Chapter 4: Creating and Visualizing DataFrames

### Section 1: Visualizing your data

This section shows a recap of Chapter 4 of a previous lesson (Introduction to Data Science in Python).

**Additional insights:** 

1. Instead of using `plt.hist(dataframe["col_name"])`, we can opt to do `dataframe["col_name"].hist()` to maybe lessen code.
2. We can also use the `kind` argument of `plt.plot()` to change the type of plot we want to use.
3. We can rotate the axis labels using the `rot` argument.

<br>

### Section 2: Missing values

Missing values are indicated with `NaN`, short for _not a number_.
* `dataframe.isna()` detects elements within the dataframe that are missing through boolean values.
* `dataframe.isna().any()` detects columns that contain missing values through boolean values.
* `dataframe.isna().sum()` counts the number of missing values per column.
  * We can also plot this to visualize the frequency of missing valeus per column.
* `dataframe.dropna()` removes rows that contain missing values.
* `dataframe.fillna(0)` replaces missing values with, in this case, 0.

<br>

### Section 3: Creating DataFrames

This section shows a recap of Chapter 2 (Sections 1-3) of the previous lesson (Intermediate Python).

**Additional insights:**

1. List of dictionaries constructs row-by-row values of a DataFrame.
```
# a list of dictionaries (keys become the column names)
list_of_dicts = [{key-value pairs that fill the first row},{key-value pairs that fill the second row}]
```
2. Dictionary of lists constructs column-by-column valeus of a DataFrame.
```
# a dictionary of lists
dict_of_lists = {col_name1:[list of column1 values],col_name2:[list of column2 values]}
```
3. Create the DataFrame by using either the list or dictionary.
```
# convert to DataFrame
pd.DataFrame(list_of_dicts)
pd.DataFrame(dict_of_lists)
```

<br>

### Section 4: Reading and writing CSVs

1. Reading

`pd.read_csv(filename)`

2. Writing

`dataframe.to_csv(filename)`



Tasks from this lesson (self-assessment)

1. Transforming Data
  * Introducing DataFrames
    * Did I use different DataFrame attributes?
  * Sorting and Subsetting
    * Did I sort values of data within columns?
  * New Columns
    * Did I append new columns to a DataFrame?

2. Aggregating Data
  * Summary Statistics
    * Did I obtain some summary statistics from the Customs dataset?
    * Did I use the `agg()` function to obtain multiple statistics and/or customized statistics?
  * Counting
    * Did I drop duplicates, if there are any?
    * Did I use the `value_counts()` function?
  * Grouped Summary Statistics
    * Did I use the `groupby()` function to collect similar groups?
  * Pivot Tables
    * Did I create pivot tables?

3. Slicing and Indexing
  * Explicit Indexes
    * Did I set a column as an index?
    * Did I create a multi-level indexed DataFrame?
  * Slicing and Subsetting with `loc[]` and `iloc[]`
    * Did I subset/slice a DataFrame using any of these methods?
  * Working with Pivot Tables
    * Did I use the `axis` argument to get a statistic across a row/column?

4. Creating and Visualizing DataFrames
  * Visualizing Your Data
    * Did I use line plots?
    * Did I use scatter plots?
    * Did I use bar plots?
    * Did I use histograms?
    * Did I use some arguments to give further details regarding my plots?
  * Missing Values
    * Did I check if data had any missing values?
    * If so, did I try to replace them with more appropriate values?
  * Creating DataFrames
    * Did I create a DataFrame using a list and/or dictionary?
  * Reading and Writing CSVs
    * Did I read a csv file when doing the applications? <br>
    * Did I write to a csv file when doing the applications?