# Working with Data in Python

This notebook will help you get started with data analysis using Python and pandas. If you've used Excel before, you'll find many familiar concepts—plus some new superpowers!

## 1. Introduction

**pandas** is a Python library for working with tabular data (like spreadsheets). It's widely used in research for:
- Cleaning and organizing survey results
- Combining data from different sources
- Calculating quick statistics
- Exporting results for reports

## 2. Getting Started with pandas

Let's import pandas and create a small example dataset. (In practice, you'd load your own CSV file.)

In [None]:
import pandas as pd

# Example data: research projects
data = {
    'project': ['Green Roof', 'Solar Study', 'Urban Heat', 'Daylight', 'Retrofit'],
    'year': [2021, 2022, 2022, 2023, 2021],
    'type': ['Sustainability', 'Energy', 'Climate', 'Lighting', 'Renovation'],
    'lead': ['Smith', 'Lee', 'Patel', 'Kim', None],
    'budget': [50000, 60000, None, 45000, 30000]
}
df = pd.DataFrame(data)

# Save to CSV for demonstration
df.to_csv('projects.csv', index=False)

# Read the CSV file (simulating a real workflow)
df = pd.read_csv('projects.csv')

df.head()  # Show the first few rows

In [None]:
# Get a summary of the dataframe
df.info()

In [None]:
# Quick statistics for numeric columns
df.describe()

## 3. Exploring the Data

Let's look at how to select columns, filter rows, and sort data.

In [None]:
# Select a single column
df['project']

In [None]:
# Select multiple columns
df[['project', 'year']]

In [None]:
# Filter rows: projects from 2022
df[df['year'] == 2022]

In [None]:
# Sort by budget (highest first)
df.sort_values('budget', ascending=False)

## 4. Cleaning the Data

Real-world data is often messy. Let's check for missing values and tidy things up.

In [None]:
# Count missing values in each column
df.isnull().sum()

In [None]:
# Fill missing values in 'lead' with 'Unknown'
df['lead'] = df['lead'].fillna('Unknown')

# Drop rows where 'budget' is missing
df = df.dropna(subset=['budget'])

df

In [None]:
# Rename a column
df = df.rename(columns={'lead': 'principal_investigator'})
df.head()

In [None]:
# Remove duplicate rows (if any)
df = df.drop_duplicates()
df

## 5. Simple Aggregation

Let's summarize our data to get useful insights.

In [None]:
# Count projects by type
df.groupby('type')['project'].count()

In [None]:
# Average budget by project type
df.groupby('type')['budget'].mean()

In [None]:
# How many projects per year?
df['year'].value_counts()

## 6. Saving Data

Once your data is clean, you can export it for use elsewhere.

In [None]:
# Save the cleaned dataframe to a new CSV file
df.to_csv('projects_cleaned.csv', index=False)

## 7. Practice Challenge (Optional)

**Task:**
- Filter all projects from 2022
- Rename the 'type' column to 'category'
- Save the result to a new CSV file called `projects_2022.csv`

## 8. Common Data Issues

### ⚠️Encoding & Special Characters

When working with data containing special characters (like Swedish å, ä, ö), you may see strange symbols or errors if the encoding is not handled correctly.

**Tip:** Always specify the encoding when reading or writing files if you expect non-English characters.

In [None]:
# Example: Saving and loading Swedish characters
df_sw = pd.DataFrame({'name': ['Malmö', 'Göteborg', 'Umeå']})
df_sw.to_csv('swedish_cities.csv', index=False, encoding='utf-8')  # Always use utf-8

# Reading with correct encoding
pd.read_csv('swedish_cities.csv', encoding='utf-8')

### ⚠️File Path Issues: Slashes, Spaces, and Case

- **Windows uses backslashes (`\`)**, while Mac/Linux use forward slashes (`/`). Python accepts both, but forward slashes are safer.
- **Spaces and mixed case** in file or folder names can cause problems. Always double-check your paths!

**Tip:** Use `r'path'` (raw strings) or `os.path.join()` to avoid mistakes.

In [None]:
import os
# Safer way to build file paths:
folder = 'My Data Folder'
filename = 'Results 2023.csv'
path = os.path.join(folder, filename)
print(path)  # Handles slashes for your OS

### ⚠️Inconsistent Variable Naming

Mixing up variable names (e.g., `ProjectName` vs `project_name`) can lead to bugs and confusion.

**Tip:** Stick to a naming convention (like `snake_case`) and be consistent throughout your code.

In [None]:
# Example: Inconsistent naming can cause errors
projectName = 'Green Roof'
# print(project_name)  # This will cause a NameError

# Consistent naming:
project_name = 'Green Roof'

### ⚠️Handling Newlines in Text Data

Sometimes, text fields contain `\n` (newline) characters. This can make your data look odd or break CSV formatting.

**Tip:** Use `str.replace()` or pandas string methods to clean up newlines if needed.

In [None]:
# Example: Cleaning newlines in a text column
df_text = pd.DataFrame({'notes': ['Line one\nLine two', 'No newline here']})
df_text['notes_clean'] = df_text['notes'].str.replace('\n', ' ', regex=False)
df_text

### Other Common Issues to Watch For

- **Date formats:** Dates may be in different formats (e.g., `YYYY-MM-DD` vs `DD/MM/YYYY`). Use `pd.to_datetime()` to standardize.
- **Missing values:** Not all missing values are `NaN`—sometimes they're empty strings or special codes.
- **Data types:** Numbers may be read as text if there are stray characters.

Always inspect your data and use pandas tools to clean and standardize!

## 9. Links & Next Steps

- [pandas documentation](https://pandas.pydata.org/docs/)
- Try the next notebook: Automating Repetitive Tasks
- Keep experimenting—every dataset is a new opportunity to learn!