# Programming for Data Science Summary
## Chapter 04 - DataFrame Characteristics and Manipulation
The goal of this notebook is to store some sort of "cheatsheet" for chapter 04 of Programming for Data Science course.

In [1]:
import pandas as pd
import numpy as np
print("> Pandas Version:",pd.__version__)
df = pd.read_csv("data.csv")

> Pandas Version: 2.2.2


### Data Attributes

In [2]:
# Data Attributes
df.columns # Returns all the columns. In particular, it will display the columns
df.shape # Returns the shape of the DataFrame
df.dtypes # Returns the data types of the columns
df.index # Returns the rows label

display()

### Basic Inspection

In [14]:
# Basic Inspection
df.head(n=5) # Returns first n rows
df.tail(k:=5) # Returns last k rows
df.info(buf=None, verbose=None) # Prints  basic information (or outputs to a specified stream)
df.sort_values(
    by="Phone",
    ascending=True
) # Returns sorted DataFrame by the specified criterion
df.T # Returns transposed DataFrame

display()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Index          100000 non-null  int64 
 1   User Id        100000 non-null  object
 2   First Name     100000 non-null  object
 3   Last Name      100000 non-null  object
 4   Sex            100000 non-null  object
 5   Email          100000 non-null  object
 6   Phone          100000 non-null  object
 7   Date of birth  100000 non-null  object
 8   Job Title      100000 non-null  object
dtypes: int64(1), object(8)
memory usage: 6.9+ MB


### Statistical Inspection

In [51]:
# Statistical Inspection
df.describe(include=np.number) # Returns DataFrame, detailing statistics of the columns with a certain datatpe (numerical or objects)
df.isna() # Returns a mask DataFrame where True iff value is missing

# Aggregators
df_s = df.iloc[:100, :1]
df_s.mean()
df_s.median()
df_s.mode()
df_s.min()
df_s.max()
df_s.quantile(.5)
df_s.var()
df_s.std()
df_s.corr("spearman", numeric_only=True)
df_s.value_counts()
df_s.kurt()
df_s.skew()

df[["Index"]].agg(["mean", "kurt"]) # Generalized version of df.describe with aggregators

display()

### Grouping

In [55]:
# Grouping
df.groupby("First Name")["Index"].agg("sum") # General syntax: df.groupby(key)[column].agg()
# Returns a DataFrame with grouped and aggregated data 
# Note: if I want multiple aggregators for each column I pass a dictionary, like {col1 : ["agg1", "agg2"], col2: ["agg3", "agg4"]}

display()

### Crosstables and PivotTables

In [49]:
# Crosstables
pd.crosstab(
    index = df["Job Title"],
    columns = df["Sex"]
) # Returns a Crosstable where each item is frequency(<x,y>) in X x Y

# Pivot Tables
df.pivot_table(
    index = df["Job Title"],
    columns = df["Sex"],
    values = "Index",
    aggfunc = "std"
) # Returns a Pivot Table where each item the value where it is considered and aggregated

display()

### Advanced Subset Grabbing

In [80]:
# Advanced Subset Grabbing
df[["Index"]].isin([1]) # Boolean DataFrame
df["Index"].between(1,5) # Note: CANNOT do double brackets
df.filter(axis=1, like="Name") # Returns table which follows a certain pattern

display()