### Pandas

In [None]:
# Returns a DataFrame
df[['Name', 'PID']]
# Rows where Name includes 'on'
df.loc[df['Name'].str.contains('on')]
# Rows where the first letter of Name is between A and L
df[dfs['Name'] < 'M']
ser.astype(int)


- NumPy enables **fast** computation involving arrays and matrices. **(optimized for speed and memory consumption.)**
- `numpy`'s main object is the **array**. In `numpy`, arrays are:
    - homogenous (all values are of the same type) -- leads to **type coercion**
    - (potentially) multi-dimensional.
- Computation in `numpy` is fast because
    - Much of it is implemented in C.
    - `numpy` arrays are stored more efficiently in memory than Python lists. 
- count(), unique(), nunique(), value_counts(normalize = ) *Returns a Series of counts of unique values*, describe()

 `pandas` is built upon `numpy` **(slow and use a lot of memory, but optimize for fast code development.)**

- A Series in `pandas` is a `numpy` array with an index.
- A DataFrame is like a dictionary of columns, each of which is a `numpy` array.
- Many operations in `pandas` are fast because they use `numpy`'s implementations.
- To access the array underlying a DataFrame or Series, use the `to_numpy` method.
    - ⚠️ Warning: `to_numpy` returns a view of the original object, not a copy! , also change the original series
    - `.values` is a soon-to-be-deprecated version of `.to_numpy()`.
- sort_values('name', ascending = False), drop_duplicates(subset = ['a', 'b']), drop(columns = [])

### Messy Data

In [None]:
students['2021 tuition'].str.strip('$').astype(float)
students['Paid'].replace({'Y': True, 'N': False})
pd.to_numeric(students['DSC 80 Final Grade'], errors='coerce') #convert to digit where string is converted to NaN
parts = students['Student Name'].str.split()
df['name'] = parts.str[1] + ', ' + parts.str[0]

Outliers
* **Consistently "incorrect" values**. Recorded ages of -1 or 99.
    - Solution: Change the value to the correct one if it is known!
* **Abnormal artifacts from the data collection process**.
    - Example: Spikes in recorded ages at round numbers (25, 30, 35, 40), or spikes in recorded COVID cases on Mondays.
    - Solution: Try "smoothing", e.g. binning the ages.
* **Unreasonable outliers**. Example: Age of 200.
    - Solution: Not sure. Could remove the row. Could be indicative of a bug in the data collection process. Could be real!

In [None]:
# All of the rows where the subject age is missing
stops[stops['subject_age'].isna()]
# fill nan with column means
nans.agg(lambda x: x.fillna(x.mean()), axis=0)

* `.dropna()` drops **rows** containing **at least one** null value.
* `.dropna(how='all')` drops **rows** containing **only** null values.
* `.dropna(axis=1)` drops **columns** containing at least one null value.
* `.fillna(val)` fills null entries with the value `val`.
* `.fillna(dict)` fills null entries using a dictionary `dict` to fill NaNs differently for each column
* `.fillna(method='bfill')` and `.fillna(method='ffill')` fill null entries using neighboring non-null entries.

`NaN` is of type `float` !

### Hypothesis Testing (assess a model given a single random sample)

* two distributions are **categorical** distributions, use TVD
* two distributions are **numerical** distributions, use difference in group means or medians

In [None]:
temp = np.random.choice([1, 0], p=[0.55, 0.45], size=(N, 114)) # numerical, return N rows, 114 cols
np.sum(np.abs(series1 - series2)) / 2 #TVD for two series
temp = np.random.multinomial(N, [0.1, 0.4, 0.5], size=num_reps) / N # categorical simulations, return num_reps rows
np.sum(np.abs(temp - eth['California'].to_numpy()), axis=1) / 2 #TVD for an array of arrays & a series
(np.array(results) >= obs).mean() # p-value

### Grouping

In [None]:
penguins.groupby('species')['body_mass_g'].aggregate(['count', 'mean'])
penguins.groupby('species').aggregate({'bill_length_mm': 'max', 'island': 'nunique'})
np.percentile(col, 75) - np.percentile(col, 25) #IQR
penguins.groupby('species')['body_mass_g'].transform(lambda ser: ser - ser.mean()) #returns Series of the same size
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 39) #filter out a group of species
#One row for every unique value in index, one col for every unique value in columns
penguins.pivot_table(index='species', 
                     columns='island', 
                     values='body_mass_g', 
                     aggfunc='mean',
                     fill_value = 0) #aggfunc: count, mean, max, sum, size...

### Combining

### Permutation (compare two random samples)

In [None]:
#test statistic - difference in group means
observed_difference = (smoking_and_birthweight.groupby('Maternal Smoker')['Birth Weight'].mean().diff().iloc[-1])
for _ in range(n_repetitions): #normal approach 
    # Step 1: Shuffle the weights
    shuffled_weights = np.random.permutation(weights)   
    # Step 2: Put them in a DataFrame
    to_shuffle['Shuffled Birth Weight'] = shuffled_weights  
    # Step 3: Compute the test statistic
    group_means = (
        to_shuffle
        .groupby('Maternal Smoker')
        .mean()
        .loc[:, 'Shuffled Birth Weight']
    )
    difference = group_means.diff().iloc[-1]  
    # Step 4: Store the result
    faster_differences.append(difference)
pval = (difference >= obs_diff).mean()

In [None]:
#faster approach
is_smoker = smoking_and_birthweight['Maternal Smoker'].values #boolean array
weights = smoking_and_birthweight['Birth Weight'].values  #boolean array
n_smokers = is_smoker.sum()
n_non_smokers = 1174 - n_smokers

is_smoker_permutations = np.column_stack([
    np.random.permutation(is_smoker)
    for _ in range(3000)
]).T

mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
ultra_fast_differences = mean_smokers - mean_non_smokers

### Missingness

MD - can i determine the missing value exactly by looking at other col?  
NMAR - is there a good reason why the messingness depend on value itself?  
MAR - do other cols tell me the likelihood that a value is missing?  
MCAR - missingness does not depent on value nor other cols. Can perform listwise deletion

- use permutation tests to verify if a column is MAR vs. MCAR.
    - Create two groups: one where values in a column are missing, and another where values in a column aren't missing.
    - To test the missingness of column X:
        - For every other column, test the null hypothesis "the distribution of (other column) is the same when column X is missing and when column X is not missing."
        - If you fail to reject the null, then column X's missingness does not depend on (other column).
        - If you reject the null, then column X is MAR dependent on (other column).
        - **If you fail to reject the null for all other columns, then column X is MCAR!**