# 🧠 Python Library Reference for Public Health & Data Science

## 📊 Data Handling
| Library     | Purpose                                      | Notes                                  |
|-------------|----------------------------------------------|----------------------------------------|
| pandas      | Data manipulation, cleaning, reshaping       | Ideal for tabular health datasets      |
| NumPy       | Fast numerical operations, arrays            | Backbone for scientific computing      |

## 📈 Visualization
| Library     | Purpose                                      | Notes                                  |
|-------------|----------------------------------------------|----------------------------------------|
| matplotlib  | Customizable plotting                        | Low-level control                      |
| seaborn     | Statistical visualizations                   | Built on matplotlib, great defaults    |
| plotly      | Interactive charts                           | Web-ready, dashboard-friendly          |

## 🧪 Statistical Analysis
| Library       | Purpose                                    | Notes                                  |
|---------------|--------------------------------------------|----------------------------------------|
| SciPy         | Statistical tests, distributions           | Includes t-tests, ANOVA, etc.          |
| statsmodels   | Regression, GLMs, time series              | Epidemiological modeling               |
| pingouin      | Easy-to-use statistical tests              | Effect sizes, repeated measures        |

## ⏳ Survival Analysis
| Library     | Purpose                                      | Notes                                  |
|-------------|----------------------------------------------|----------------------------------------|
| lifelines   | Kaplan-Meier, Cox models                     | Visual survival curves                 |

## 🤖 Machine Learning
| Library       | Purpose                                    | Notes                                  |
|---------------|--------------------------------------------|----------------------------------------|
| scikit-learn  | ML algorithms, preprocessing                | Classification, regression, clustering |
| xgboost       | Gradient boosting                          | Fast, accurate, handles imbalance      |
| lightgbm      | Gradient boosting                          | Faster training, lower memory usage    |

## 🌍 Geospatial & Environmental
| Library     | Purpose                                      | Notes                                  |
|-------------|----------------------------------------------|----------------------------------------|
| geopandas   | Spatial data manipulation                    | Built on pandas                        |
| rasterio    | Raster data access                           | Environmental modeling                 |
| shapely     | Geometric operations                         | Buffering, intersections               |

## 🧬 Biomedical & Clinical
| Library     | Purpose                                      | Notes                                  |
|-------------|----------------------------------------------|----------------------------------------|
| PyHealth    | Deep learning for EHR                        | Clinical prediction tasks              |
| BioPython   | Genomic and biological sequence analysis     | DNA, protein, etc.                     |
| pydicom     | Medical imaging (DICOM format)               | Radiology and imaging workflows        |

## 🔗 Integration & APIs
| Library     | Purpose                                      | Notes                                  |
|-------------|----------------------------------------------|----------------------------------------|
| fhirclient  | HL7 FHIR API access                          | Clinical data interoperability         |


# Basic Data Analysis and Visualization in Python

This guide provides a walkthrough of common exploratory data analysis tasks using Python, primarily with the **pandas**, **seaborn**, and **matplotlib** libraries.

---

## 1. Setup and Placing Data into a DataFrame

The first step in any analysis is to import the necessary libraries and load your data into a pandas DataFrame. A DataFrame is a 2-dimensional labeled data structure, like a spreadsheet.

### Code
```python
import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample list of data
data = [25, 41, 33, 29, 45, 25, 38, 33, 40, 22, 51, 48, 35, 33, 42]

# Place the data into a pandas DataFrame
df = pd.DataFrame({'score': data})
print("DataFrame created:")
print(df.head())
```
Instead of typing or pasting in data as a list, you can import a .csv file to work with. For the simplest import, ensure the file is saved in the same folder as your Python script or notebook. If the file is not in the same directory, you will need to provide the full file path to the pd.read_csv() function.

### Code
```python
# Read the .csv file and create a DataFrame
df = pd.read_csv('sample_data.csv')
print("DataFrame created:")
print(df.head())
```

---

## 2. Sorting Data

You can easily sort the data in a DataFrame from the smallest to the largest value.

### Code
```python
# Sort the DataFrame by the 'score' column
df_sorted = df.sort_values(by='score')

print("Sorted DataFrame:")
print(df_sorted.head())
```

---

## 3. Frequency Distribution Tables

A frequency distribution shows how often different values occur in a dataset. We can create bins to group continuous data and calculate the frequency, relative frequency, and cumulative frequency.

### Code
```python
# 0. (optional) Determine bin size using Sturge's rule. Alternatively, import my_stats_tools as mst.
def sturges_step(data):
    n = len(data)
    k = math.ceil(np.log2(n) + 1) # number of bins
    step = (np.max(data) - np.min(data)) / k
    return step

print(f"Suggested step size: {step}")

# 1. Create bins for the data
bins = np.arange(20, 60, 5) # Bins of size 5, from 20 up to (but not including) 60

# 2. Group data into bins
df['binned'] = pd.cut(df['score'], bins=bins, right=False)

# 3. Calculate frequencies
freq = df['binned'].value_counts().sort_index()
rel_freq = df['binned'].value_counts(normalize=True).sort_index()
cum_freq = rel_freq.cumsum() # Cumulative sum of the relative frequency

# 4. Combine into a single table
dist_table = pd.DataFrame({
    'Frequency': freq,
    'Relative Frequency': rel_freq,
    'Cumulative Frequency': cum_freq
})

print("Frequency Distribution Table:")
print(dist_table)
```

---

## 4. Measures of Central Tendency

Measures of central tendency describe the center of a dataset.

### Code
```python
# Calculate Mean
mean_val = df['score'].mean()

# Calculate Median
median_val = df['score'].median()

# Calculate Mode
mode_val = df['score'].mode()

print(f"Mean: {mean_val:.2f}")
print(f"Median: {median_val:.2f}")
print(f"Mode: {mode_val}")
```
You can also get the mean and median from the `.describe()` method:
```python
print(df['score'].describe())
```
---

## 5. Measures of Dispersion

Measures of dispersion (or variability) describe the spread of the data.

### Code
```python
# Calculate Range
range_val = df['score'].max() - df['score'].min()

# Calculate Variance
var_val = df['score'].var()

# Calculate Standard Deviation
std_val = df['score'].std()

# Calculate Coefficient of Variation
CV_val = std_val / mean_val

# Calculate Interquartile Range (IQR)
q1 = df['score'].quantile(0.25) # Be advised that these are interpolated values
q3 = df['score'].quantile(0.75) # Be advised that these are interpolated values
iqr_val = q3 - q1

print(f"Range: {range_val}")
print(f"Variance: {var_val:.2f}")
print(f"Standard Deviation: {std_val:.2f}")
print(f"Coefficient of Variation: {CV_val:.2f}")
print(f"First Quartile Q1: {q1}")
print(f"Third Quartile Q3: {q3}")
print(f"Interquartile Range (IQR): {iqr_val}")
```

---

## 6. Skewness, Kurtosis, and Modality

These measures describe the shape of the data's distribution.

* **Skewness**: Measures the asymmetry of the distribution.
* **Kurtosis**: Measures the "tailedness" of the distribution.
* **Modality**: Describes the number of peaks in the distribution (unimodal, bimodal, etc.). This is observed visually from a histogram.

### Code
```python
# Calculate Skewness
skew_val = df['score'].skew()

# Calculate Kurtosis
kurt_val = df['score'].kurt()

print(f"Skewness: {skew_val:.2f}")
print(f"Kurtosis: {kurt_val:.2f}")
print("Modality: Observe the number of peaks in the histogram below.")
```

---

## 7. Data Visualization

Visualizing data is crucial for understanding its characteristics.

### Stem-and-Leaf Plot
There is no built-in function in pandas/seaborn, so we create our own. Alternatively, import my_stats_tools as mst.

```python
def create_stem_and_leaf(data_list, title="Stem-and-Leaf Display"):
    print(title)
    print("-" * len(title))
    if not data_list:
        print("Data list is empty."); return
    stem_leaf = {}; data_list.sort()
    for num in data_list:
        stem, leaf = num // 10, num % 10
        if stem not in stem_leaf: stem_leaf[stem] = []
        stem_leaf[stem].append(leaf)
    for stem, leaves in sorted(stem_leaf.items()):
        print(f" {stem} | {' '.join(map(str, leaves))}")

create_stem_and_leaf(df['score'].tolist())
```

### Histogram
A histogram shows the frequency distribution as bars.


```python
sns.histplot(data=df, x='score', bins=bins, kde=True)
plt.title('Histogram of Scores')
plt.show()
```

### Frequency Polygon
A line graph connecting the midpoints of the histogram bars.


```python
# Calculate midpoints and frequencies for the polygon
frequencies = dist_table['Frequency'].values
frequencies_anchored = np.concatenate(([0], frequencies, [0]))
midpoints = np.array(bins[:-1]) + 2.5
midpoints_anchored = np.concatenate(([midpoints[0] - 5], midpoints, [midpoints[-1] + 5]))

# Plot
sns.histplot(data=df, x='score', bins=bins, color='lightblue', alpha=0.5)
plt.plot(midpoints_anchored, frequencies_anchored, marker='o', color='red', label='Frequency Polygon')
plt.title('Frequency Polygon of Scores')
plt.legend()
plt.show()
```

### Box-and-Whisker Plot
A box plot summarizes the five-number summary: minimum, Q1, median, Q3, and maximum. It's excellent for spotting outliers.

```python
sns.boxplot(data=df, y='score')
plt.title('Box-and-Whisker Plot of Scores')
plt.show()
```