# Useful Functions on DataFrames
Here are some of the most useful functions on DataFrames. 

**df.apply()** - Apply a function along *an* axis of the DataFrame (either rows or columns). See the "08_Dataframe Apply" section for an example. <br>
**df.columns()** - Returns a list of column names. Column names can be used as keys to extract columns from a dataframe. <br>
**df.head(#)** and **df.tail(#)** - Returns the first (head) or last (tail) # rows of a dataframe. <br>
**df.describe()** - Returns quick statistics by column, including mean, median, <br>
**df.sort_values()** - Sort values in a column, similar to using a Filter in Excel. Note that the index remains the same. <br>
**df.sort_index()** - Sort by index. Useful for returning order after sorting by a certain column. <br>
**df.groupby()** - Collapse rows together that share one or more same column values. Useful for going from a fine to coarse analysis. For example, if data is saved by country, we could groupby country and analyze by region or continent. <br>

## Initial dataframe to explore functions in

In [21]:
import numpy as np
import pandas as pd

# For this example, I downloaded May 2022 data
# This will take 1-2 minutes as there are 400,000+ rows of data!
df = pd.read_excel("../Datasets/all_data_M_2022.xlsx")

In [None]:
df

## df.describe()

Note that Python will run statistics on ALL columns even if the values don't make sense. We would also expect a column like 'H_MEDIAN' to have statistics, but columns with non-numeric values will not provide a describe column. These columns need to be modified first like we did in the 
Sanitizing Data Example. 

In [None]:
df.describe() # Quick statistics by column

### Sanitizing Columns for expanded df.describe()

In [8]:
columns_to_modify = ['H_MEAN', 'A_MEAN', 'MEAN_PRSE', 'H_PCT10', 
                     'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 
                     'A_PCT10', 'A_PCT25', 'A_MEDIAN', 'A_PCT75', 
                     'A_PCT90'
                    ]

# Splits columns by number type wanted
int_columns = ['A_MEAN', 'A_PCT10', 'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90']
float_columns = ['H_MEAN', 'MEAN_PRSE', 'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90']

# Initialize dict of column and its type
dict_column_and_type = {}

# Attach column name with the data type
for column_name in columns_to_modify:
    if column_name in int_columns:
        # 'Int64' is needed when mixing ints with pandas.NA
        dict_column_and_type[column_name] = 'Int64'
    else:
        # 'Float64' is needed when mixing floats with pandas.NA
        dict_column_and_type[column_name] = 'Float64'

# To use .apply, we want the input of the definition to be the value in the table
def remove_commas(text):
    return text.replace(",", "")

def replace_symbols(text):
    # pd.NA is Not a Number
    new_text = text.replace("*", pd.NA)
    new_text = new_text.replace("#", pd.NA) # Notice we are modifying our new_text variable
    new_text = new_text.replace(np.nan, pd.NA) # Notice we are AGAIN modifying our new_text variable
    # While RegEx could have done this all in a single step, it was easy to make a few .replace() statements to achieve the same purpose
    return new_text

# Initialize our sanitized dataframe
sanitized_df = df.copy()

# Apply our two functions
sanitized_df[columns_to_modify] = sanitized_df[columns_to_modify].apply(remove_commas)
sanitized_df[columns_to_modify] = sanitized_df[columns_to_modify].apply(replace_symbols)

# Using the dictonary, we cast each column as its associated data type
sanitized_df = sanitized_df.astype(dict_column_and_type)

## Revised df for df.describe()

In [None]:
sanitized_df.describe() # Once we clean those columns, they appear in the .describe() output!

## Counting and Sorting values

In [None]:
area_titles = sanitized_df['AREA_TITLE'].value_counts() # List of all unique entiries and their count in a column
area_titles

In [None]:
sanitized_df.sort_values(by='H_MEDIAN', inplace=True) # Sort by a column. Note the index remains
sanitized_df

In [None]:
sanitized_df.sort_index() # Sort by index. Useful for returning order after sorting by a certain column

## df.groupby().how_to_group()
groupby is used when we want to combine rows based on one or more columns. We need to specify how we are combining the other columns though! In the example below, we specify the column we want to group everything by (AREA_TITLE) and the columns we want to combine based on AREA_TITLE. We then specify we want to combine matching AREA_TITLE rows by averaging the values (e.g., taking the mean). 

In [None]:
columns_to_focus_on=['AREA_TITLE', 'H_MEAN', 'A_MEAN', 'H_MEDIAN', 'A_MEDIAN']
grouped_df=sanitized_df[columns_to_focus_on].groupby(['AREA_TITLE']).mean()
grouped_df