# Comprehensive Exploratory Data Analysis (EDA) Guide

A complete, step-by-step guide covering all essential EDA concepts from absolute beginner to advanced techniques.

## Table of Contents - Beginner Level

- Environment Setup
- Data Import and Export
- Initial Data Inspection
- Understanding Data Structure
- Data Types and Conversion
- Basic Data Selection
- Data Sampling
- Column Operations
- Row Operations
- Basic Data Cleaning
- Simple Statistics
- Value Counting and Frequency
- Basic Sorting
- Simple Filtering
- Basic Grouping

## Environment Setup

### Essential Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Basic Configuration

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

plt.rcParams['figure.figsize'] = (10, 6)
sns.set_style("whitegrid")

## Data Import and Export

### Reading CSV Files

In [4]:
# Basic CSV reading
df = pd.read_csv('data.csv')

# With different separators
df = pd.read_csv('data.csv', sep=';')
df = pd.read_csv('data.csv', sep='\t')

# Handling headers
df = pd.read_csv('data.csv', header=0)
df = pd.read_csv('data.csv', header=None)
df = pd.read_csv('data.csv', names=['col1', 'col2', 'col3'])

# Skipping rows
df = pd.read_csv('data.csv', skiprows=1)
df = pd.read_csv('data.csv', skiprows=[0,2])

# Reading specific columns
df = pd.read_csv('data.csv', usecols=['name', 'age', 'salary'])
df = pd.read_csv('data.csv', usecols=[0, 1, 3])

# Limiting rows
df = pd.read_csv('data.csv', nrows=1000)

### Reading Excel Files

In [None]:
df = pd.read_excel('data.xlsx')

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_excel('data.xlsx', sheet_name=0)

all_sheets = pd.read_excel('data.xlsx', sheet_name=None)

### Creating Sample Data for Examples

Let's create some sample data to work with throughout this guide:

In [6]:
# Create sample dataset for demonstrations
np.random.seed(42)

data = {
    'name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'age': [25, 30, 35, 28, 42, 33, 29, 38, 26, 45],
    'salary': [50000, 60000, 70000, 55000, 85000, 62000, 58000, 75000, 52000, 90000],
    'department': ['IT', 'HR', 'IT', 'Finance', 'IT', 'HR', 'Marketing', 'IT', 'Finance', 'IT'],
    'years_experience': [3, 8, 12, 5, 18, 10, 6, 15, 4, 20]
}

df = pd.DataFrame(data)
print("Sample dataset created:")
df

Sample dataset created:


Unnamed: 0,name,age,salary,department,years_experience
0,John,25,50000,IT,3
1,Jane,30,60000,HR,8
2,Bob,35,70000,IT,12
3,Alice,28,55000,Finance,5
4,Charlie,42,85000,IT,18
5,Diana,33,62000,HR,10
6,Eve,29,58000,Marketing,6
7,Frank,38,75000,IT,15
8,Grace,26,52000,Finance,4
9,Henry,45,90000,IT,20


## Initial Data Inspection

### Basic Dataset Information

In [7]:
# Dataset dimensions
print("Rows:", len(df))
print("Columns:", len(df.columns))
print("Shape:", df.shape)

# Column names
print("Column names:", df.columns.tolist())

# Index information
print("Index:", df.index)
print("Index name:", df.index.name)

Rows: 10
Columns: 5
Shape: (10, 5)
Column names: ['name', 'age', 'salary', 'department', 'years_experience']
Index: RangeIndex(start=0, stop=10, step=1)
Index name: None


### First Look at Data

In [8]:
# First few rows
print("First 5 rows:")
df.head()

First 5 rows:


Unnamed: 0,name,age,salary,department,years_experience
0,John,25,50000,IT,3
1,Jane,30,60000,HR,8
2,Bob,35,70000,IT,12
3,Alice,28,55000,Finance,5
4,Charlie,42,85000,IT,18


In [9]:
# Last few rows
print("Last 3 rows:")
df.tail(3)

Last 3 rows:


Unnamed: 0,name,age,salary,department,years_experience
7,Frank,38,75000,IT,15
8,Grace,26,52000,Finance,4
9,Henry,45,90000,IT,20


In [10]:
# Random sample
print("Random sample of 3 rows:")
df.sample(3)

Random sample of 3 rows:


Unnamed: 0,name,age,salary,department,years_experience
8,Grace,26,52000,Finance,4
1,Jane,30,60000,HR,8
5,Diana,33,62000,HR,10


### Data Overview

In [11]:
# General information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   name              10 non-null     object
 1   age               10 non-null     int64 
 2   salary            10 non-null     int64 
 3   department        10 non-null     object
 4   years_experience  10 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 532.0+ bytes


In [12]:
# Data types
print("Data types:")
df.dtypes

Data types:


name                object
age                  int64
salary               int64
department          object
years_experience     int64
dtype: object

## Basic Data Selection

### Selecting Columns

In [13]:
# Single column
name_column = df['name']
print("Name column:")
print(name_column.head())

# Multiple columns
subset = df[['name', 'age', 'salary']]
print("\nSubset with name, age, salary:")
subset.head()

Name column:
0       John
1       Jane
2        Bob
3      Alice
4    Charlie
Name: name, dtype: object

Subset with name, age, salary:


Unnamed: 0,name,age,salary
0,John,25,50000
1,Jane,30,60000
2,Bob,35,70000
3,Alice,28,55000
4,Charlie,42,85000


In [14]:
# Select by position
first_three_cols = df.iloc[:, :3]
print("First 3 columns:")
first_three_cols.head()

First 3 columns:


Unnamed: 0,name,age,salary
0,John,25,50000
1,Jane,30,60000
2,Bob,35,70000
3,Alice,28,55000
4,Charlie,42,85000


### Selecting Rows

In [15]:
# By position
first_row = df.iloc[0]
print("First row:")
print(first_row)

# First 5 rows
first_five = df.iloc[:5]
print("\nFirst 5 rows:")
first_five

First row:
name                 John
age                    25
salary              50000
department             IT
years_experience        3
Name: 0, dtype: object

First 5 rows:


Unnamed: 0,name,age,salary,department,years_experience
0,John,25,50000,IT,3
1,Jane,30,60000,HR,8
2,Bob,35,70000,IT,12
3,Alice,28,55000,Finance,5
4,Charlie,42,85000,IT,18


## Column Operations

### Adding New Columns

In [16]:
# Add a column with the same value for everyone
df['status'] = 'active'
print("After adding status column:")
df.head()

After adding status column:


Unnamed: 0,name,age,salary,department,years_experience,status
0,John,25,50000,IT,3,active
1,Jane,30,60000,HR,8,active
2,Bob,35,70000,IT,12,active
3,Alice,28,55000,Finance,5,active
4,Charlie,42,85000,IT,18,active


In [17]:
# Create a column by combining other columns
df['full_info'] = df['name'] + ' is ' + df['age'].astype(str) + ' years old'
print("After adding full_info column:")
df[['name', 'age', 'full_info']].head()

After adding full_info column:


Unnamed: 0,name,age,full_info
0,John,25,John is 25 years old
1,Jane,30,Jane is 30 years old
2,Bob,35,Bob is 35 years old
3,Alice,28,Alice is 28 years old
4,Charlie,42,Charlie is 42 years old


In [18]:
# Add a column with conditions (if-else logic)
df['age_group'] = np.where(df['age'] >= 30, 'Senior', 'Junior')
print("After adding age_group column:")
df[['name', 'age', 'age_group']].head()

After adding age_group column:


Unnamed: 0,name,age,age_group
0,John,25,Junior
1,Jane,30,Senior
2,Bob,35,Senior
3,Alice,28,Junior
4,Charlie,42,Senior


## Simple Statistics

### Basic Descriptive Statistics

In [19]:
# Basic statistics for numeric columns
df.describe()

Unnamed: 0,age,salary,years_experience
count,10.0,10.0,10.0
mean,33.1,65700.0,10.1
std,6.806043,13832.730911,5.989806
min,25.0,50000.0,3.0
25%,28.25,55750.0,5.25
50%,31.5,61000.0,9.0
75%,37.25,73750.0,14.25
max,45.0,90000.0,20.0


In [20]:
# Individual statistics for age column
print("Age statistics:")
print(f"Mean: {df['age'].mean():.2f}")
print(f"Median: {df['age'].median():.2f}")
print(f"Min: {df['age'].min()}")
print(f"Max: {df['age'].max()}")
print(f"Standard deviation: {df['age'].std():.2f}")
print(f"Range: {df['age'].max() - df['age'].min()}")

Age statistics:
Mean: 33.10
Median: 31.50
Min: 25
Max: 45
Standard deviation: 6.81
Range: 20


### Value Counting and Frequencies

In [21]:
# Count how many times each value appears
print("Department counts:")
counts = df['department'].value_counts()
print(counts)

Department counts:
department
IT           5
HR           2
Finance      2
Marketing    1
Name: count, dtype: int64


In [22]:
# Show as percentages instead of counts
print("Department percentages:")
percentages = df['department'].value_counts(normalize=True) * 100
print(percentages)

Department percentages:
department
IT           50.0
HR           20.0
Finance      20.0
Marketing    10.0
Name: proportion, dtype: float64


## Basic Filtering

### Simple Conditions

In [23]:
# Filter by age (older than 30)
adults_over_30 = df[df['age'] > 30]
print("People older than 30:")
adults_over_30

People older than 30:


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
2,Bob,35,70000,IT,12,active,Bob is 35 years old,Senior
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
5,Diana,33,62000,HR,10,active,Diana is 33 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior


In [24]:
# Filter by salary (high earners)
high_earners = df[df['salary'] > 60000]
print("High earners (salary > 60000):")
high_earners

High earners (salary > 60000):


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
2,Bob,35,70000,IT,12,active,Bob is 35 years old,Senior
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
5,Diana,33,62000,HR,10,active,Diana is 33 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior


In [25]:
# Filter by department (specific department)
it_department = df[df['department'] == 'IT']
print("IT Department employees:")
it_department

IT Department employees:


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
0,John,25,50000,IT,3,active,John is 25 years old,Junior
2,Bob,35,70000,IT,12,active,Bob is 35 years old,Senior
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior


### Multiple Conditions

In [26]:
# AND conditions (young and high earners)
young_high_earners = df[(df['age'] < 30) & (df['salary'] > 55000)]
print("Young high earners (age < 30 AND salary > 55000):")
young_high_earners

Young high earners (age < 30 AND salary > 55000):


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
6,Eve,29,58000,Marketing,6,active,Eve is 29 years old,Junior


In [27]:
# OR conditions (senior or high earners)
senior_or_high_earners = df[(df['age'] > 40) | (df['salary'] > 70000)]
print("Senior or high earners (age > 40 OR salary > 70000):")
senior_or_high_earners

Senior or high earners (age > 40 OR salary > 70000):


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior


## Basic Grouping

### Simple Grouping

In [28]:
# Group by department
dept_groups = df.groupby('department')

# Group sizes
print("Group sizes (number of employees per department):")
print(dept_groups.size())

Group sizes (number of employees per department):
department
Finance      2
HR           2
IT           5
Marketing    1
dtype: int64


In [29]:
# Average salary by department
print("Average salary by department:")
print(dept_groups['salary'].mean())

Average salary by department:
department
Finance      53500.0
HR           61000.0
IT           74000.0
Marketing    58000.0
Name: salary, dtype: float64


### Basic Aggregations

In [30]:
# Multiple aggregations for salary by department
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'sum', 'count', 'min', 'max']
})
print("Department salary statistics:")
dept_stats

Department salary statistics:


Unnamed: 0_level_0,salary,salary,salary,salary,salary
Unnamed: 0_level_1,mean,sum,count,min,max
department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Finance,53500.0,107000,2,52000,55000
HR,61000.0,122000,2,60000,62000
IT,74000.0,370000,5,50000,90000
Marketing,58000.0,58000,1,58000,58000


In [31]:
# Multiple columns aggregation
summary = df.groupby('department').agg({
    'salary': 'mean',
    'age': 'mean',
    'years_experience': 'mean'
}).round(2)
print("Summary by department:")
summary

Summary by department:


Unnamed: 0_level_0,salary,age,years_experience
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Finance,53500.0,27.0,4.5
HR,61000.0,31.5,9.0
IT,74000.0,37.0,13.6
Marketing,58000.0,29.0,6.0


## Basic Sorting

### Single and Multiple Column Sorting

In [32]:
# Sort by age (ascending)
df_sorted_age = df.sort_values('age')
print("Sorted by age (ascending):")
df_sorted_age

Sorted by age (ascending):


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
0,John,25,50000,IT,3,active,John is 25 years old,Junior
8,Grace,26,52000,Finance,4,active,Grace is 26 years old,Junior
3,Alice,28,55000,Finance,5,active,Alice is 28 years old,Junior
6,Eve,29,58000,Marketing,6,active,Eve is 29 years old,Junior
1,Jane,30,60000,HR,8,active,Jane is 30 years old,Senior
5,Diana,33,62000,HR,10,active,Diana is 33 years old,Senior
2,Bob,35,70000,IT,12,active,Bob is 35 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior


In [33]:
# Sort by salary (descending)
df_sorted_salary = df.sort_values('salary', ascending=False)
print("Sorted by salary (descending):")
df_sorted_salary

Sorted by salary (descending):


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
2,Bob,35,70000,IT,12,active,Bob is 35 years old,Senior
5,Diana,33,62000,HR,10,active,Diana is 33 years old,Senior
1,Jane,30,60000,HR,8,active,Jane is 30 years old,Senior
6,Eve,29,58000,Marketing,6,active,Eve is 29 years old,Junior
3,Alice,28,55000,Finance,5,active,Alice is 28 years old,Junior
8,Grace,26,52000,Finance,4,active,Grace is 26 years old,Junior
0,John,25,50000,IT,3,active,John is 25 years old,Junior


In [34]:
# Sort by multiple columns (department first, then salary)
df_sorted_multi = df.sort_values(['department', 'salary'], ascending=[True, False])
print("Sorted by department (ascending) and salary (descending):")
df_sorted_multi

Sorted by department (ascending) and salary (descending):


Unnamed: 0,name,age,salary,department,years_experience,status,full_info,age_group
3,Alice,28,55000,Finance,5,active,Alice is 28 years old,Junior
8,Grace,26,52000,Finance,4,active,Grace is 26 years old,Junior
5,Diana,33,62000,HR,10,active,Diana is 33 years old,Senior
1,Jane,30,60000,HR,8,active,Jane is 30 years old,Senior
9,Henry,45,90000,IT,20,active,Henry is 45 years old,Senior
4,Charlie,42,85000,IT,18,active,Charlie is 42 years old,Senior
7,Frank,38,75000,IT,15,active,Frank is 38 years old,Senior
2,Bob,35,70000,IT,12,active,Bob is 35 years old,Senior
0,John,25,50000,IT,3,active,John is 25 years old,Junior
6,Eve,29,58000,Marketing,6,active,Eve is 29 years old,Junior


### Top/Bottom Values

In [35]:
# Top 3 earners
top_earners = df.nlargest(3, 'salary')
print("Top 3 earners:")
top_earners[['name', 'salary', 'department']]

Top 3 earners:


Unnamed: 0,name,salary,department
9,Henry,90000,IT
4,Charlie,85000,IT
7,Frank,75000,IT


In [36]:
# Bottom 3 earners
bottom_earners = df.nsmallest(3, 'salary')
print("Bottom 3 earners:")
bottom_earners[['name', 'salary', 'department']]

Bottom 3 earners:


Unnamed: 0,name,salary,department
0,John,50000,IT
8,Grace,52000,Finance
3,Alice,55000,Finance


## Conclusion

This notebook has covered the essential EDA techniques including:

- **Environment Setup**: Importing libraries and configuring settings
- **Data Inspection**: Understanding data structure, types, and basic information
- **Data Selection**: Accessing specific rows, columns, and subsets
- **Column Operations**: Adding, modifying, and organizing columns
- **Statistics**: Descriptive statistics and value counting
- **Filtering**: Simple and complex conditional filtering
- **Grouping**: Aggregating data by categories
- **Sorting**: Organizing data in meaningful order

These foundational skills form the basis for more advanced exploratory data analysis techniques. Practice with different datasets to master these concepts!