# TechFlow Data Analysis - Module 1
## Pandas: Series & DataFrames

**Your Role:** Data Analyst at TechFlow (B2B SaaS Company)

**Your Mission:** Master the building blocks of Pandas - Series and DataFrames.

**Why this matters:**
- Pandas is THE library for data analysis in Python
- Series = single column of data (like a list with an index)
- DataFrame = table of data (like a spreadsheet or SQL table)
- Everything in data analysis starts with loading and exploring data

**This module covers:**
- Loading data from files (CSV, TSV, various formats)
- Understanding Series and DataFrames
- Exploring data structure and types
- Selecting columns and rows
- Basic data inspection
- Handling different data sources

**Dataset files in this module:**
- `TechFlow.csv` - Full 50-customer dataset
- `customers_small.csv` - Simple 10-customer dataset
- `monthly_revenue.csv` - Time series revenue data
- `support_tickets.tsv` - Tab-separated ticket data
- `customers_messy.csv` - Messy data for cleaning
- `nps_surveys.csv` - Survey response data

**Time to complete:** ~60 minutes

---

# SETUP: Import Libraries

Every Pandas script starts with importing the library.

In [None]:
# Standard imports - run this cell first!
import pandas as pd
import numpy as np

# Set display options for better output
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 200)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print("Ready to go!")

---
# PART 1: Loading Data from Files

Data lives in files. Let's learn to load various formats.

## 1.1 Loading CSV Files

**CSV (Comma-Separated Values)** is the most common data format.

**Load a CSV file**

```python
# Load the small customer dataset
df = pd.read_csv('../dataset/customers_small.csv')

# Display it
df
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**What just happened?**
- `pd.read_csv()` reads the CSV file
- Returns a **DataFrame** (table of data)
- Column headers come from the first row
- Index (row numbers) added automatically

**Terminology:**
- Each **row** is a record (customer)
- Each **column** is a field/variable (attribute)
- The **index** is on the left (0, 1, 2, ...)

**Load the full TechFlow dataset**

```python
# Load full dataset (50 customers, 32 columns)
df_full = pd.read_csv('../dataset/TechFlow.csv')

print(f"Shape: {df_full.shape}")
print(f"That's {df_full.shape[0]} customers with {df_full.shape[1]} attributes each")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 1.2 read_csv() Options

Control how data is loaded with parameters.

**Select specific columns**

```python
# Load only the columns we need
df = pd.read_csv(
    '../dataset/TechFlow.csv',
    usecols=['CustomerID', 'CompanyName', 'Industry', 'MonthlyRevenue']
)

df.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Set a column as the index**

```python
# Use CustomerID as row labels
df = pd.read_csv(
    '../dataset/customers_small.csv',
    index_col='CustomerID'
)

df
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Read only first N rows**

Useful for previewing large files.

```python
# Load only first 5 rows
df = pd.read_csv('../dataset/TechFlow.csv', nrows=5)

df
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 1.3 Loading TSV and Other Delimiters

Not all files use commas as separators.

**Load tab-separated file**

```python
# TSV = Tab-Separated Values
tickets = pd.read_csv('../dataset/support_tickets.tsv', sep='\t')

tickets
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Common separators:**
| Separator | Character | Use Case |
|-----------|-----------|----------|
| Comma | `,` | Default CSV |
| Tab | `\t` | TSV files |
| Semicolon | `;` | European CSVs |
| Pipe | `\|` | Some exports |

---
# PART 2: Understanding DataFrames

A DataFrame is a 2D table with labeled rows and columns.

**Load our working dataset**

```python
# Load small dataset for learning
df = pd.read_csv('../dataset/customers_small.csv')

# Display it
df
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 2.1 Basic Inspection

**head() - View first rows**

```python
# First 5 rows (default)
df.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**tail() - View last rows**

```python
# Last 3 rows
df.tail(3)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**shape - Get dimensions**

```python
# (rows, columns)
print(f"Shape: {df.shape}")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**columns - Get column names**

```python
# List all columns
print(df.columns)

# As a regular Python list
print(list(df.columns))
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**index - Get row labels**

```python
# View the index
print(df.index)

# As a list
print(list(df.index))
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 2.2 Data Types and Info

**dtypes - Column data types**

```python
# See data type of each column
df.dtypes
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Common Pandas dtypes:**
| dtype | Meaning | Examples |
|-------|---------|----------|
| `int64` | Integer numbers | 1, 42, -5 |
| `float64` | Decimal numbers | 3.14, -0.5 |
| `object` | Text/strings | "TechFlow", "Basic" |
| `bool` | True/False | True, False |
| `datetime64` | Dates/times | 2024-01-15 |

**info() - Comprehensive summary**

```python
# Complete overview of the DataFrame
df.info()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**What info() tells you:**
- Total number of rows
- Column names and their data types
- Non-null count (to spot missing values)
- Memory usage

**describe() - Statistical summary**

```python
# Statistics for numeric columns
df.describe()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**describe() provides:**
- **count**: Non-null values
- **mean**: Average
- **std**: Standard deviation
- **min/max**: Range
- **25%, 50%, 75%**: Quartiles (50% = median)

---
# PART 3: Understanding Series

A **Series** is a single column of data - like a labeled list.

## 3.1 Selecting a Single Column

**Select column with bracket notation**

```python
# Get MonthlyRevenue column as a Series
revenue = df['MonthlyRevenue']

print(type(revenue))
revenue
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select column with dot notation**

Works for column names without spaces or special characters.

```python
# Same result, different syntax
revenue = df.MonthlyRevenue

revenue
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**When to use which:**
- `df['column']` - Always works, use for column names with spaces
- `df.column` - Cleaner, but only for simple column names

## 3.2 Series Properties

**Series attributes**

```python
revenue = df['MonthlyRevenue']

print(f"Name: {revenue.name}")
print(f"Type: {revenue.dtype}")
print(f"Length: {len(revenue)}")
print(f"Values: {revenue.values}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Series vs List vs NumPy array**

```python
revenue = df['MonthlyRevenue']

# Series - labeled, powerful
print(f"Series: {type(revenue)}")

# Values as NumPy array
print(f"Array: {type(revenue.values)}")

# Convert to list
print(f"List: {type(revenue.tolist())}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 3.3 Series Operations

**Basic statistics**

```python
revenue = df['MonthlyRevenue']

print(f"Sum: ${revenue.sum():,}")
print(f"Mean: ${revenue.mean():.2f}")
print(f"Median: ${revenue.median():.2f}")
print(f"Min: ${revenue.min():,}")
print(f"Max: ${revenue.max():,}")
print(f"Std Dev: ${revenue.std():.2f}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Value counts - Frequency table**

```python
# How many customers per plan?
df['SubscriptionPlan'].value_counts()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Unique values**

```python
# What industries do we have?
print(f"Unique industries: {df['Industry'].nunique()}")
print(f"List: {df['Industry'].unique()}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 4: Selecting Multiple Columns

Select a subset of columns to create a new DataFrame.

**Select multiple columns with a list**

```python
# Select specific columns
df_subset = df[['CompanyName', 'Industry', 'MonthlyRevenue']]

df_subset
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Note the double brackets!**
- `df['column']` â†’ Series (single column)
- `df[['column']]` â†’ DataFrame (1+ columns)
- `df[['col1', 'col2']]` â†’ DataFrame with multiple columns

**Reorder columns**

```python
# Put columns in a specific order
df_reordered = df[['CompanyName', 'MonthlyRevenue', 'Industry', 'SubscriptionPlan']]

df_reordered.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Exclude columns**

```python
# Drop specific columns
df_without = df.drop(columns=['CustomerID', 'Cancelled'])

df_without.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 5: Selecting Rows

Multiple ways to select specific rows.

## 5.1 Slicing Rows

**Select rows by position (slicing)**

```python
# First 3 rows
df[:3]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Rows from the middle**

```python
# Rows 3, 4, 5 (indices 3:6)
df[3:6]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 5.2 iloc - Integer Location

Select by row/column **position** (0-based integers).

**Select single row**

```python
# First row (position 0)
df.iloc[0]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select multiple rows**

```python
# Rows 0, 2, 4
df.iloc[[0, 2, 4]]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select rows and columns**

```python
# Rows 0-2, columns 1-3
df.iloc[0:3, 1:4]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select single value**

```python
# Value at row 0, column 1
df.iloc[0, 1]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


## 5.3 loc - Label Location

Select by row/column **labels** (names).

**Setup: Set CustomerID as index**

```python
# Make CustomerID the row label
df_indexed = df.set_index('CustomerID')

df_indexed.head()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select row by label**

```python
# Get customer 1001
df_indexed.loc[1001]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select rows and specific columns**

```python
# Customer 1001, specific columns
df_indexed.loc[1001, ['CompanyName', 'Industry', 'MonthlyRevenue']]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Select multiple rows by label**

```python
# Customers 1001, 1003, 1005
df_indexed.loc[[1001, 1003, 1005]]
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**iloc vs loc Summary:**

| Method | Selection by | Includes endpoint? |
|--------|-------------|-------------------|
| `iloc` | Position (0, 1, 2...) | No |
| `loc` | Label (names) | Yes |

---
# PART 6: Creating DataFrames

Create DataFrames from scratch.

**From a dictionary**

```python
# Create from dictionary (columns as keys)
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
}

df_new = pd.DataFrame(data)
df_new
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**From list of dictionaries**

```python
# Each dict is a row
customers = [
    {'name': 'TechFlow', 'revenue': 500, 'plan': 'Enterprise'},
    {'name': 'MediCare', 'revenue': 50, 'plan': 'Basic'},
    {'name': 'EduLearn', 'revenue': 150, 'plan': 'Standard'}
]

df_from_list = pd.DataFrame(customers)
df_from_list
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Create a Series**

```python
# From a list
revenues = pd.Series([500, 50, 150, 600, 150], name='Revenue')
print(revenues)

# With custom index
revenues_labeled = pd.Series(
    [500, 50, 150],
    index=['TechFlow', 'MediCare', 'EduLearn'],
    name='Revenue'
)
print("\nWith labels:")
print(revenues_labeled)
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 7: Working with Multiple Files

Real analysis often involves multiple data sources.

**Load all our datasets**

```python
# Load various files
customers = pd.read_csv('../dataset/customers_small.csv')
monthly = pd.read_csv('../dataset/monthly_revenue.csv')
tickets = pd.read_csv('../dataset/support_tickets.tsv', sep='\t')
surveys = pd.read_csv('../dataset/nps_surveys.csv')

print("Loaded datasets:")
print(f"  customers: {customers.shape}")
print(f"  monthly: {monthly.shape}")
print(f"  tickets: {tickets.shape}")
print(f"  surveys: {surveys.shape}")
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Explore the monthly revenue data**

```python
# Time series format - columns are months
monthly
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Explore the support tickets**

```python
# Notice missing values in some columns
tickets
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Explore messy data**

```python
# This data has issues we'll need to clean
messy = pd.read_csv('../dataset/customers_messy.csv')
messy
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Check messy data info**

```python
# Notice data types and missing values
messy.info()
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


---
# PART 8: Saving Data

Write DataFrames back to files.

**Save to CSV**

```python
# Create a subset
df_subset = df[['CompanyName', 'Industry', 'MonthlyRevenue']]

# Save to CSV
df_subset.to_csv('../dataset/customers_subset.csv', index=False)

print("Saved customers_subset.csv")

# Verify by reading it back
pd.read_csv('../dataset/customers_subset.csv')
```

In [None]:
# â†“ Type the code below, then press Shift+Enter to run


**Note:** `index=False` prevents adding an extra index column.

---
# PRACTICE: Business Scenarios

Apply what you've learned!

### Q1: Load and inspect the full TechFlow dataset

Load TechFlow.csv and show: shape, columns, and first 3 rows.

In [None]:
# Your answer:


### Q2: Get revenue statistics

Calculate mean, median, min, and max of MonthlyRevenue.

In [None]:
# Your answer:


### Q3: Count customers by plan

How many customers are on each SubscriptionPlan?

In [None]:
# Your answer:


### Q4: Select key columns

Create a DataFrame with only: CompanyName, Industry, MonthlyRevenue, NPS_Score

In [None]:
# Your answer:


### Q5: Use iloc to get first 3 customers, first 4 columns

Select rows 0-2 and columns 0-3 using iloc.

In [None]:
# Your answer:


### Q6: List all unique industries

How many unique industries are there? List them.

In [None]:
# Your answer:


### Q7: Create a customer DataFrame from scratch

Create a DataFrame with 3 customers: name, revenue, and industry.

In [None]:
# Your answer:


---
# CHEAT SHEET

## Loading Data
```python
# CSV
df = pd.read_csv('file.csv')
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])
df = pd.read_csv('file.csv', index_col='ID')
df = pd.read_csv('file.csv', nrows=100)

# TSV / Other delimiters
df = pd.read_csv('file.tsv', sep='\t')
```

## Saving Data
```python
df.to_csv('output.csv', index=False)
```

## Inspection
```python
df.head(n)      # First n rows
df.tail(n)      # Last n rows
df.shape        # (rows, cols)
df.columns      # Column names
df.index        # Row labels
df.dtypes       # Data types
df.info()       # Full summary
df.describe()   # Stats
```

## Selecting Columns
```python
df['column']              # Series
df[['col1', 'col2']]      # DataFrame
df.drop(columns=['col'])  # Exclude
```

## Selecting Rows
```python
df[0:5]                   # First 5 rows
df.iloc[0]                # Row by position
df.iloc[[0, 2, 4]]        # Multiple rows
df.iloc[0:3, 0:4]         # Rows & cols
df.loc[label]             # Row by label
df.loc[label, 'col']      # Specific cell
```

## Series Operations
```python
s.sum()          # Total
s.mean()         # Average
s.median()       # Middle value
s.min() / s.max() # Extremes
s.std()          # Standard deviation
s.value_counts() # Frequency table
s.unique()       # Unique values
s.nunique()      # Count unique
```

## Creating Data
```python
# DataFrame from dict
pd.DataFrame({'col': [1, 2, 3]})

# Series
pd.Series([1, 2, 3], name='values')
```

---
## Module 1 Complete! ðŸŽ‰

**You now know how to:**
- âœ… Load CSV and TSV files with pd.read_csv()
- âœ… Understand DataFrame structure (rows, columns, index)
- âœ… Inspect data with head(), tail(), shape, info(), describe()
- âœ… Work with Series (single columns)
- âœ… Calculate basic statistics (sum, mean, value_counts)
- âœ… Select columns (single and multiple)
- âœ… Select rows with iloc (position) and loc (label)
- âœ… Create DataFrames from scratch
- âœ… Save data back to CSV

**Key Takeaways:**
1. DataFrame = table, Series = column
2. Use `df['col']` for one column, `df[['col1','col2']]` for multiple
3. `iloc` = integer position, `loc` = label
4. Always inspect data first: shape, info(), head()
5. `value_counts()` is your friend for categorical data

**Next: Module 2 - Indexing & GroupBy**