In [1]:
import pandas as pd

print(pd.__version__)

2.3.3


# Core Data Structures in Pandas

Panda is build on two main data structures:
- Series -> One dimensional (like a single column in Excel)
- DataFrame -> Two dimensional (like a full spreadsheet or SQL table)

In [2]:
s1 = pd.Series([71,75,82,78,88,92])
s1

0    71
1    75
2    82
3    78
4    88
5    92
dtype: int64

Notic the automatic index: 0,1,2,3

In [3]:
type(s1)

pandas.core.series.Series

In [4]:
s2 = pd.Series([71,75,82,78,88,92], index=["Kabir", "Haris", "Kamala", "Victor", "Maryam", "Lisa"])
s2

Kabir     71
Haris     75
Kamala    82
Victor    78
Maryam    88
Lisa      92
dtype: int64

In above example, we explicitly define the index values

In [5]:
print("Kabir marks:",s2["Kabir"])
print("Maryam marks:",s2["Maryam"])


Kabir marks: 71
Maryam marks: 88


Now I can easily filter out the marks using the index val as reference

A `pandas.Series` may look similar to a Python dictionary because both store data with labels, but a Series offers much more. Unlike a dictionary, a Series supports fast vectorized operations, automatic index alignment during arithmetic, and handles missing data using `NaN`. It also allows both label-based and position-based access, and integrates seamlessly with the pandas ecosystem, especially DataFrames. While a dictionary is great for simple key–value storage, a Series is better suited for data analysis and manipulation tasks where performance, flexibility, and built-in functionality matter.

### Pandas Data Frames

In [6]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25,30,35],
    "City": ["New York", "Paris", "Moscow"],
    "Program": ["BS-Ai", "BS-Acc&Fin", "BS-Comp.Sci"]
}

In [7]:
data

{'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [25, 30, 35],
 'City': ['New York', 'Paris', 'Moscow'],
 'Program': ['BS-Ai', 'BS-Acc&Fin', 'BS-Comp.Sci']}

In [8]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Program
0,Alice,25,New York,BS-Ai
1,Bob,30,Paris,BS-Acc&Fin
2,Charlie,35,Moscow,BS-Comp.Sci


In [9]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [10]:
df.columns

Index(['Name', 'Age', 'City', 'Program'], dtype='object')

### Pandas Operatiions build on these foundations:
- Selection
- Filtering
- Merging
- Aggregation

Series = 1D array with labels | 
DataFrame = 2D table with rows + columns

In [12]:
data = [['Kabir', 78], ['Kainat', 82], ['Jack', 75], ['Jane', 88]]
data

[['Kabir', 78], ['Kainat', 82], ['Jack', 75], ['Jane', 88]]

In [16]:
pd.DataFrame(data, columns=['Name', 'Marks']) #Explicitly define the column name

Unnamed: 0,Name,Marks
0,Kabir,78
1,Kainat,82
2,Jack,75
3,Jane,88


In [17]:
import numpy as np

In [31]:
data = ([[1,2], [5,6]])
data

[[1, 2], [5, 6]]

In [32]:
arr = np.array(data)
arr

array([[1, 2],
       [5, 6]])

In [34]:
df = pd.DataFrame(arr, columns=['Val', 'Num'])
df

Unnamed: 0,Val,Num
0,1,2
1,5,6


## Reading an Excel using Pandas

In [39]:
df = pd.read_excel('data.xlsx') #You can also read csv file using { pd.read_csv(file_name) }
df

Unnamed: 0,Name,School,Marks
0,Kabir,SPS,78
1,Joshua,UPS,68
2,Jack,WPS,45
3,Alex,RTS,92
4,Carey,RRY,88
5,Linda,TCS,62
6,Yaris,CSS,79
7,Kainat,SPS,81


## Calling a sample data using url

In [40]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


# Performing Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the investigative phase of data science that builds a deep conceptual grasp of datasets through statistical summaries, visualizations, and pattern detection, ensuring robust downstream modeling by addressing quality issues and uncovering insights iteratively. 

## Example
For a finance example, consider analyzing a stock portfolio dataset: EDA might reveal through histograms that returns are right-skewed (indicating rare high gains), scatter plots showing negative correlation between bond and equity returns (diversification benefit), and heatmaps exposing multicollinearity among sector stocks, guiding risk assessment and portfolio optimization strategies like mean-variance analysis.

Below are some key metrics to initiate the EDA Stage:

In [42]:
df.head() #First five rows

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [43]:
df.tail() #Last five rows

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


In [45]:
df.info() #info about the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [46]:
df.describe() #Stats facts of the datasets

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [48]:
df.columns #To see the column interface in data

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [55]:
print(f"Matrix Shape: {df.shape} means we have (xxx(rows) and xxx(cols)")

Matrix Shape: (244, 7) means we have (xxx(rows) and xxx(cols)


## Summary
Pandas queries such as df.head(), df.tail(), df.info(), df.describe(), and df.columns are essential for initial EDA, offering quick glimpses into dataset structure, content, and statistics. These help identify data types, missing values, distributions, and potential issues early on.

- df.head(n): Shows the first n rows (default 5) for a top-level preview. Example: df.head(3) displays initial entries like customer IDs and ages to spot patterns or errors right away.
- df.tail(n): Displays the last n rows (default 5) to check the end of the data. Example: df.tail(2) reveals recent transactions, useful for time-series data to ensure completeness.
- df.info(): Provides a concise summary of data types, non-null counts, and memory usage. Example: Reveals 'Age' as int64 with 10 missing values out of 100 rows, highlighting imputation needs.
- df.describe(): Generates statistical summaries (mean, std, min, max, quartiles) for numerical columns. Example: For 'Income', mean=45k, std=15k, min=20k, max=100k, indicating spread and potential outliers.
- df.columns: Lists all column names as an Index object. Example: Returns Index(['CustomerID', 'Age', 'Income', 'Purchase']), aiding in feature selection or renaming.

## Finance Example
Consider a theoretical finance dataset on stock market performance, with columns for Date, Ticker Symbol, Opening Price, Closing Price, Volume Traded, and Dividend Yield. Using these operations conceptually:

- The head reveals early 2025 entries for stocks like AAPL, showing rising opening prices, confirming upward trends.
- The tail exposes late 2025 data, highlighting a market dip in closing prices, suggesting volatility.
- Info indicates all 1,000 rows are non-null except for 5% missing dividends, with prices as floats, signaling potential imputation needs for yield calculations.
- Describe computes a mean closing price of $150, std dev of $20 (indicating fluctuation), and max volume of 10 million shares, pointing to high-liquidity days for trading strategy development.
- Columns lists the features, enabling selection of 'Closing Price' and 'Volume' for correlation analysis in portfolio diversification.
