# **Day 3 – Python + Pandas Introduction**

Day 3 of the QuantLake Internship

## **Objective**

- Learn the basics of `pandas`, the most essential Python library for data analytics
- Practice loading, exploring, and manipulating real-world datasets
- Understand key differences between `Series` and `DataFrame`
- Build strong fundamentals for future analysis

## **Section 1: Pandas Basics – Series & DataFrame**

In this section, i create Pandas Series and DataFrames from scratch.
A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional table with labeled rows and columns — similar to an Excel sheet or SQL table.

We also explore `.head()`, `.tail()`, `.shape`, `.columns`, `.index`, and `.dtypes` to inspect the dataset’s structure.



In [1]:
import pandas as pd
import numpy as np

In [2]:
# Creating a Pandas Series
marks = pd.Series([88, 76, 93, 85], name="Marks")
print("Series:\n", marks)

Series:
 0    88
1    76
2    93
3    85
Name: Marks, dtype: int64


In [3]:
# Creating a DataFrame from dictionary
data = {
    'Name': ['Sandhya', 'Dev', 'Shiv', 'Lakshmi'],
    'Age': [21, 22, 21, 23],
    'City': ['Indore', 'Ahmedabad', 'Nagpur', 'Vizag']
}
df = pd.DataFrame(data)
print("\nDataFrame:\n", df)


DataFrame:
       Name  Age       City
0  Sandhya   21     Indore
1      Dev   22  Ahmedabad
2     Shiv   21     Nagpur
3  Lakshmi   23      Vizag


In [4]:
# Basic inspection
print("\nHead:\n", df.head())
print("\nTail:\n", df.tail())
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
print("Index:", df.index)
print("Data Types:\n", df.dtypes)


Head:
       Name  Age       City
0  Sandhya   21     Indore
1      Dev   22  Ahmedabad
2     Shiv   21     Nagpur
3  Lakshmi   23      Vizag

Tail:
       Name  Age       City
0  Sandhya   21     Indore
1      Dev   22  Ahmedabad
2     Shiv   21     Nagpur
3  Lakshmi   23      Vizag
Shape: (4, 3)
Columns: ['Name', 'Age', 'City']
Index: RangeIndex(start=0, stop=4, step=1)
Data Types:
 Name    object
Age      int64
City    object
dtype: object


## **Section 2: Load and View Real Dataset (Iris)**

This section focuses on loading a real-world dataset (iris) using `seaborn.load_dataset()` and exploring it using Pandas tools like `.info()`, `.describe()`, and `.isnull().sum()`

These methods help understand:

- Dataset size
- Data types

- Missing values
- Statistical summary (mean, std, min, etc.)

In [5]:
import seaborn as sns

In [6]:
# Load the built-in Iris dataset
iris = sns.load_dataset('iris')
print(iris.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


In [7]:
# Basic info
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [8]:
# Descriptive statistics
print(iris.describe())

       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


In [9]:
# Check for null values
print("Null values in each column:\n", iris.isnull().sum())

Null values in each column:
 sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


In [10]:
# Row and column count
print("Rows:", iris.shape[0], "| Columns:", iris.shape[1])

Rows: 150 | Columns: 5


## **Section 3: Accessing & Filtering Data**

Here we learn how to access and filter specific parts of the data using:

df['column'], df[['col1', 'col2']]

.loc[] (label-based) and .iloc[] (index-based)
We also:

Add a new column (e.g. derived metric)

Drop unwanted rows or columns using .drop()

In [11]:
# Accessing single and multiple columns
print(iris['species'].head())
print(iris[['sepal_length', 'species']].head())

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object
   sepal_length species
0           5.1  setosa
1           4.9  setosa
2           4.7  setosa
3           4.6  setosa
4           5.0  setosa


In [12]:
# Using loc and iloc
print("\nUsing loc:\n", iris.loc[0:4, ['sepal_length', 'species']])
print("\nUsing iloc:\n", iris.iloc[0:5, 0:2])


Using loc:
    sepal_length species
0           5.1  setosa
1           4.9  setosa
2           4.7  setosa
3           4.6  setosa
4           5.0  setosa

Using iloc:
    sepal_length  sepal_width
0           5.1          3.5
1           4.9          3.0
2           4.7          3.2
3           4.6          3.1
4           5.0          3.6


In [13]:
# Add a new column
iris['sepal_area'] = iris['sepal_length'] * iris['sepal_width']
print("\nAdded new column 'sepal_area':\n", iris.head())


Added new column 'sepal_area':
    sepal_length  sepal_width  petal_length  petal_width species  sepal_area
0           5.1          3.5           1.4          0.2  setosa       17.85
1           4.9          3.0           1.4          0.2  setosa       14.70
2           4.7          3.2           1.3          0.2  setosa       15.04
3           4.6          3.1           1.5          0.2  setosa       14.26
4           5.0          3.6           1.4          0.2  setosa       18.00


In [14]:
# Drop a column
iris_dropped = iris.drop(columns=['petal_width'])
print("\nDropped 'petal_width':\n", iris_dropped.head())


Dropped 'petal_width':
    sepal_length  sepal_width  petal_length species  sepal_area
0           5.1          3.5           1.4  setosa       17.85
1           4.9          3.0           1.4  setosa       14.70
2           4.7          3.2           1.3  setosa       15.04
3           4.6          3.1           1.5  setosa       14.26
4           5.0          3.6           1.4  setosa       18.00


## **Section 4: Pandas Built-in Methods**

In this section, we apply commonly used Pandas methods to summarize and analyze data:

- `.sort_values()`, `.value_counts()`, `.unique()`

- `.mean()`, `.sum()`, `.min()`, `.max()`
These help us perform exploratory data analysis (EDA) and understand key patterns or distributions in the dataset.

In [15]:
# Sorting values
sorted_iris = iris.sort_values(by='sepal_length', ascending=False)
print("Sorted by sepal_length:\n", sorted_iris.head())

Sorted by sepal_length:
      sepal_length  sepal_width  petal_length  petal_width    species  \
131           7.9          3.8           6.4          2.0  virginica   
122           7.7          2.8           6.7          2.0  virginica   
118           7.7          2.6           6.9          2.3  virginica   
117           7.7          3.8           6.7          2.2  virginica   
135           7.7          3.0           6.1          2.3  virginica   

     sepal_area  
131       30.02  
122       21.56  
118       20.02  
117       29.26  
135       23.10  


In [16]:
# Unique values and value counts
print("Unique species:", iris['species'].unique())
print("Species counts:\n", iris['species'].value_counts())

Unique species: ['setosa' 'versicolor' 'virginica']
Species counts:
 species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64


In [17]:
# Aggregation functions
print("Mean Sepal Length:", iris['sepal_length'].mean())
print("Total Sepal Area:", iris['sepal_area'].sum())
print("Max Petal Length:", iris['petal_length'].max())

Mean Sepal Length: 5.843333333333334
Total Sepal Area: 2673.4299999999994
Max Petal Length: 6.9


## **Summary**

I learned how to use Pandas to handle tabular data effectively.

🧠 Key Concepts Covered:
- Difference between **Series** and **DataFrame**
- How to **inspect**, **filter**, and **modify** a dataset
- Used built-in functions to sort, aggregate, and summarize data
- Practiced with a real-world dataset (`iris`) and also created dummy data

🎯 With this, I now feel confident using Pandas to explore and clean data, which is the first step of every data analytics project.
