## Useful Built-in Pandas Functions for Exploratory Data Analysis

### Basic DataFrame Information
- `.shape`: Returns a tuple of (rows, columns) showing the dimensions of your DataFrame
- `.columns`: Returns an array of all the column headers
- `.dtypes`: Shows the data type of each column (e.g., int64, float64, object)
- `.info()`: Provides a concise summary including column names, non-null counts, and memory usage

### Data Inspection
- `.head(n=5)`: Displays the first n rows (default 5) of the DataFrame
- `.tail(n=5)`: Displays the last n rows (default 5) of the DataFrame
- `.sample(n=5)`: Returns a random sample of n rows from the DataFrame

### Summary Statistics
- `.describe()`: Returns a table of key summary statistics for numerical columns (count, mean, std, min, quartiles, max)
- `.describe(include='all')`: Includes statistics for all column types, not just numerical
- `.value_counts()`: For a Series, shows unique values and their frequencies
- `.nunique()`: Returns the number of unique values in each column
- `.unique()`: Returns array of unique values (for Series only)

### Missing Data Analysis
- `.isnull().sum()`: Shows the count of missing values per column
- `.isna().sum()`: Identical to isnull() - shows missing value counts
- `.dropna()`: Removes rows with missing values
- `.fillna(value)`: Fills missing values with a specified value

### Data Selection & Filtering
- `.loc[rows, columns]`: Label-based selection for specific rows and columns
- `.iloc[rows, columns]`: Integer position-based selection
- `.query('condition')`: Filter rows using a string expression (e.g., `data.query('maleLE > 75')`)

### Sorting & Ranking
- `.sort_values('column')`: Sorts DataFrame by specified column(s)
- `.sort_values('column', ascending=False)`: Sorts in descending order
- `.nlargest(n, 'column')`: Returns n rows with largest values in specified column
- `.nsmallest(n, 'column')`: Returns n rows with smallest values in specified column

### Grouping & Aggregation
- `.groupby('column')`: Groups data by unique values in specified column(s)
- `.groupby('column').mean()`: Calculate mean for each group
- `.groupby('column').agg({'col1': 'mean', 'col2': 'sum'})`: Apply different aggregations to different columns
- `.pivot_table()`: Create a spreadsheet-style pivot table for data summarization

### Correlation & Relationships
- `.corr()`: Computes pairwise correlation between numeric columns
- `.corrwith(other_series)`: Computes correlation with another Series or DataFrame
- `.cov()`: Computes pairwise covariance between numeric columns




---

<div class="alert alert-block alert-info">
<h4>Pandas cheatsheet: Exploratory data analysis</h4>

Useful functions for exploring dataset structures.

<br>

**Basic DataFrame information**:
- `.shape`: Returns a tuple of (rows, columns) showing the dimensions of your DataFrame
- `.columns`: Returns an array of all the column headers
- `.dtypes`: Shows the data type of each column (e.g., int64, float64, object)
- `.info()`: Provides a concise summary including column names, non-null counts, and memory usage

<br>

**Data Inspection**
- `.head(n=5)`: Displays the first n rows (default 5) of the DataFrame
- `.tail(n=5)`: Displays the last n rows (default 5) of the DataFrame
- `.sample(n=5)`: Returns a random sample of n rows from the DataFrame

<br>

**Summary Statistics**
- `.describe()`: Returns a table of key summary statistics for numerical columns (count, mean, std, min, quartiles, max)
- `.describe(include='all')`: Includes statistics for all column types, not just numerical
- `.value_counts()`: For a Series, shows unique values and their frequencies
- `.nunique()`: Returns the number of unique values in each column
- `.unique()`: Returns array of unique values (for Series only)

</div>

<br>

In [None]:
### Quick examples to try with the Glasgow Dataset (add to separate code cells):
import pandas as pd
data_url = 'https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/main/charts/extreme/glasgowHealthData.csv'
data = pd.read_csv(data_url)

# Check dataset dimensions
print(f"Dataset has {data.shape[0]} rows and {data.shape[1]} columns")

# Quick overview of the data
data.info()

# Summary statistics for all numeric columns
data.describe()

# Check for missing values
data.isnull().sum()

# Find areas with highest child poverty
data.nlargest(10, 'childPoverty')[['areaName', 'childPoverty']]

# Correlation between different deprivation measures
data[['incomeDeprevation', 'employmentDeprivation', 'childPoverty']].corr()