Pandas is all about DataStructures - containers for organizing data, like spreadsheets in python. Two core DSA are:
1. series: A one-dimensional array, like a single column in a spreadsheet.
2. DataFrame: A two-diemnsional table, like entire spreadsheets with rows and columns.

In [3]:
import pandas as pd
data_path="..\Datasets\penguins.csv"
penguins=pd.read_csv(data_path)
penguins.head(n=2)

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007


In [4]:
# on taking island column as series: 
island_series=penguins["island"]
type(island_series)

pandas.core.series.Series

In [5]:
island_series.head()

0    Torgersen
1    Torgersen
2    Torgersen
3    Torgersen
4    Torgersen
Name: island, dtype: object

In [6]:
island_series.dtypes

dtype('O')

In [7]:
island_series.index
# it starts from 0 to 342. 

RangeIndex(start=0, stop=344, step=1)

In [8]:
island_series.describe()
#only 3 values are unique and others seems repeated. 

count        344
unique         3
top       Biscoe
freq         168
Name: island, dtype: object

In [9]:
# to get all unique values. 
island_series.unique()


array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

In [10]:
# to get count of unique values.
island_series.nunique()

3

In [11]:
# values which are null
island_series.isnull().sum()

#no null values in this column


np.int64(0)

In [12]:
body_mass_series=penguins["body_mass_g"]
body_mass_series.isnull().sum()

np.int64(2)

In [13]:
body_mass_series.describe()

count     342.000000
mean     4201.754386
std       801.954536
min      2700.000000
25%      3550.000000
50%      4050.000000
75%      4750.000000
max      6300.000000
Name: body_mass_g, dtype: float64

In [14]:
# to convert all values in uppercase
island_series.str.upper()

0      TORGERSEN
1      TORGERSEN
2      TORGERSEN
3      TORGERSEN
4      TORGERSEN
         ...    
339        DREAM
340        DREAM
341        DREAM
342        DREAM
343        DREAM
Name: island, Length: 344, dtype: object

# Understanding DataFrames: Spreadsheet in the code - table in the rows and columns that holds data for analysis. 

In [15]:
# simple example of creating dataframe:
import pandas as pd

# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)
print(df)

# Load a CSV file
# df = pd.read_csv('data.csv')  # Uncomment to load a real CSV

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   75000


In [16]:
import pandas as pd
penguins_data=pd.read_csv("..\Datasets\penguins.csv")
type(penguins_data)

pandas.core.frame.DataFrame

In [17]:
# Display the first five rows for a quick data preview 
penguins_data.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [18]:
#"Initial dataset exploration - structure and content overview", in past we have done it but considered as revision of them.
penguins.shape 
# Show the dataset’s dimensions as a tuple

(344, 9)

In [19]:
penguins_data.columns #another attribute 
# List all column names to understand available features 

Index(['rowid', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

In [20]:
# Summarize column types, non-null counts, and memory usage
penguins_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB


In [21]:
# Fill missing values in 'body_mass_g' with zero for cleaner analysis 
penguins_data['body_mass_g'].fillna(0)

0      3750.0
1      3800.0
2      3250.0
3         0.0
4      3450.0
        ...  
339    4000.0
340    3400.0
341    3775.0
342    4100.0
343    3775.0
Name: body_mass_g, Length: 344, dtype: float64

In [22]:
#  "Select multiple columns for demographic analysis"
penguins_data[['sex','year']]


Unnamed: 0,sex,year
0,male,2007
1,female,2007
2,female,2007
3,,2007
4,female,2007
...,...,...
339,male,2009
340,female,2009
341,male,2009
342,male,2009


In [23]:
#data types of columns
penguins_data.dtypes

# Typical Strings are represented as objects in dataframe.
# integer are inbt64.

rowid                  int64
species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object

In [24]:
penguins_data.describe()
# Compute descriptive statistics for all numeric columns

Unnamed: 0,rowid,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,344.0,342.0,342.0,342.0,342.0,344.0
mean,172.5,43.92193,17.15117,200.915205,4201.754386,2008.02907
std,99.448479,5.459584,1.974793,14.061714,801.954536,0.818356
min,1.0,32.1,13.1,172.0,2700.0,2007.0
25%,86.75,39.225,15.6,190.0,3550.0,2007.0
50%,172.5,44.45,17.3,197.0,4050.0,2008.0
75%,258.25,48.5,18.7,213.0,4750.0,2009.0
max,344.0,59.6,21.5,231.0,6300.0,2009.0


In [25]:
# misisng data in Dataframes
penguins_data.isnull().sum()
# Count missing values per column to assess data cleanliness

rowid                 0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [26]:
# add new column in dataframes
# Add a new column converting grams to kilograms for body mass
penguins_data["body_mass_kg"]=penguins_data["body_mass_g"]/1000

# Preview the top two rows of the modified dataset 
penguins_data.head(n=2)


Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8


In [27]:
penguins_data.drop("body_mass_g",axis=1) 
# Remove the 'body_mass_g' column (use axis=1 for columns) 


Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,sex,year,body_mass_kg
0,1,Adelie,Torgersen,39.1,18.7,181.0,male,2007,3.750
1,2,Adelie,Torgersen,39.5,17.4,186.0,female,2007,3.800
2,3,Adelie,Torgersen,40.3,18.0,195.0,female,2007,3.250
3,4,Adelie,Torgersen,,,,,2007,
4,5,Adelie,Torgersen,36.7,19.3,193.0,female,2007,3.450
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,male,2009,4.000
340,341,Chinstrap,Dream,43.5,18.1,202.0,female,2009,3.400
341,342,Chinstrap,Dream,49.6,18.2,193.0,male,2009,3.775
342,343,Chinstrap,Dream,50.8,19.0,210.0,male,2009,4.100


In [28]:
# Final Notebook: DataFrame using palmer penguins Dataset.
import pandas as pd
penguins_data=pd.read_csv("..\Datasets\penguins.csv")
print("first two rows of penguins Datsets")
print(penguins_data.head(n=2))

#2. Accessing and exploring Data
# view basic info
print("\nDataset info:")
print(penguins_data.info())

#View Dimensions (rows,columns)
print("\nShape:", penguins_data.shape)

# summary statistics for numeric columns:
print("\nSummary statistics:")
print(penguins_data.describe())

# select specific columns
print("\nSpecies and Body mass")
print(penguins_data[['species','body_mass_g']].head())


# Add a new column: flipper length in cm
penguins_data["flipper_length_cm"]=penguins_data["flipper_length_mm"]/10
print("\nwith Flipper length in cm")
print(penguins_data[["flipper_length_cm","flipper_length_mm"]].head())


# drop the new column 
penguins_data=penguins_data.drop("flipper_length_cm",axis=1)
print(penguins_data.head())

#drop rows with any remaining missing values 
penguins_data_clean=penguins_data.dropna()
print("\nShape after dropping rows with missing values:", penguins_data_clean)

first two rows of penguins Datsets
   rowid species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0      1  Adelie  Torgersen            39.1           18.7              181.0   
1      2  Adelie  Torgersen            39.5           17.4              186.0   

   body_mass_g     sex  year  
0       3750.0    male  2007  
1       3800.0  female  2007  

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 n

# Access Data with LOC (label based indexing) 
It lets to select rows and columns from a DataFrame using labels (eg: row indices, column_names) or conditions. A way to point directly at the parts of DataFrame we want, using human readable names or logical filters.
1. Key idea: loc uses labe;ls (row indices,column name) or boolean conditionsw to access data,unlike iloc,which uses numerical positions.
2. Syntax: df.loc[row_selection, column_selection]

loc lets us to work with meaningful labels and conditions, making code readable and flexible. 

In [29]:
import pandas as pd
penguins_data=pd.read_csv("..\Datasets\penguins.csv")
penguins_selection=penguins_data.loc[0:4, ["rowid","species","sex"]]
penguins_selection

#.loc, its ability to filter data based in conditions

Unnamed: 0,rowid,species,sex
0,1,Adelie,male
1,2,Adelie,female
2,3,Adelie,female
3,4,Adelie,
4,5,Adelie,female


In [30]:
# find only those whose bdymass is greater than 4000
heavy=penguins_data.loc[penguins_data["body_mass_g"]>4000]
print(heavy.head(n=4))

# without head function we could find that only 177 rows have boy_mass_g>4000. 

    rowid species     island  bill_length_mm  bill_depth_mm  \
7       8  Adelie  Torgersen            39.2           19.6   
9      10  Adelie  Torgersen            42.0           20.2   
14     15  Adelie  Torgersen            34.6           21.1   
17     18  Adelie  Torgersen            42.5           20.7   

    flipper_length_mm  body_mass_g   sex  year  
7               195.0       4675.0  male  2007  
9               190.0       4250.0   NaN  2007  
14              198.0       4400.0  male  2007  
17              197.0       4500.0  male  2007  


In [31]:
# Refine selection by specifying the column that we want to include
heavier=penguins_data.loc[penguins_data["body_mass_g"]>4000,
                        ["species","rowid"]
                        ]
heavier

Unnamed: 0,species,rowid
7,Adelie,8
9,Adelie,10
14,Adelie,15
17,Adelie,18
19,Adelie,20
...,...,...
321,Chinstrap,322
323,Chinstrap,324
329,Chinstrap,330
333,Chinstrap,334


In [32]:
heavier=penguins_data.loc[(penguins_data["body_mass_g"]>4000) &(penguins_data["flipper_length_mm"]>195.0), # performing boolean operation so make sure to include conditions inside (). 
                        ["species","rowid"]
                        ]
heavier

# out of 344 rows only 151 rows met such conditions. 

Unnamed: 0,species,rowid
14,Adelie,15
17,Adelie,18
35,Adelie,36
43,Adelie,44
53,Adelie,54
...,...,...
321,Chinstrap,322
323,Chinstrap,324
329,Chinstrap,330
333,Chinstrap,334


In [33]:
# update specific rows and columns
# Update 'sex' column where the value is missing (NaN) to 'Male'
penguins_data.loc[penguins_data["sex"].isnull(),"sex"]="Male"
penguins_data

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,Male,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


In [34]:
# Add 100g to body_mass_g for female penguins
penguins_data.loc[penguins_data["sex"]=="female","body_mass_g"]+=1000
penguins_data

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,4800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,4250.0,female,2007
3,4,Adelie,Torgersen,,,,,Male,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,4450.0,female,2007
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,341,Chinstrap,Dream,43.5,18.1,202.0,4400.0,female,2009
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


# **Takeaway**: With `loc`, We can select and modify data using intuitive labels and conditions.
Core Uses:

1. Select rows: df.loc[0] or df.loc[df['species'] == 'Gentoo'].
2. Select columns: df.loc[:, 'body_mass_g'] or df.loc[:, ['species', 'body_mass_g']].
3. Modify data: df.loc[condition, 'column'] = new_value.

# **Access Data with iloc**:  integer -location-based-indexing method in Pandas.
It lets us selects row and columns from a DataFrame using numerical index.  loc is about meaningful names and conditions, iloc is about raw positions ("give me first row and third column). Like navigating spreasheet by countingrows and columns starting from zero.
<br>

1. keyidea: iloc uses integer positions (0-based) to access data, ignoring labels or names.
2. Syntax: df.iloc[row_selection, column_selection]
<br>

iloc is perfect when we need to access data by position, especiallty when labels are complex, missing and irrelevant or when we are iterating over a DataFrame. 

In [35]:
import pandas as pd
penguins_data=pd.read_csv("..\Datasets\penguins.csv")
#select the first column. 
penguins_data.iloc[:,0].head()

0    1
1    2
2    3
3    4
4    5
Name: rowid, dtype: int64

In [36]:
# select multiple columns (indices 0 and 5)
penguins_data.iloc[:,[0,5]].head()

# Note: The : in df.iloc[:, 0] means “select all rows.”

Unnamed: 0,rowid,flipper_length_mm
0,1,181.0
1,2,186.0
2,3,195.0
3,4,
4,5,193.0


In [37]:
# selecting subsets: combine row and columns to grab a specific subset
# Select first 3 rows and first 2 columns (species, island)
penguins_data.iloc[0:3,0:2]

Unnamed: 0,rowid,species
0,1,Adelie
1,2,Adelie
2,3,Adelie


In [38]:
# Select specific rows and columns (rows 1, 3 and columns 0, 5)
penguins_data.iloc[[1,3],[0,5]]
# Why it matters: Selecting subsets by position is great for quick data extraction, especially when column names are complex or we are scripting.

Unnamed: 0,rowid,flipper_length_mm
1,2,186.0
3,4,


In [39]:
# modifying Data with iloc
penguins_data.iloc[0,6]=4000
penguins_data.iloc[0]

rowid                        1
species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             4000.0
sex                       male
year                      2007
Name: 0, dtype: object

In [40]:
# Increase body_mass_g for first 3 rows by 100
penguins_data.iloc[0:3,6]+=100
penguins_data.iloc[0:3,[0,6]]

Unnamed: 0,rowid,body_mass_g
0,1,4100.0
1,2,3900.0
2,3,3350.0


In [42]:
# key differences between iloc and loc metgod in pandas. 

import pandas as pd
penguins=pd.read_csv("..\Datasets\penguins.csv")
penguins.loc[0:4,["rowid","body_mass_g","sex"]]

Unnamed: 0,rowid,body_mass_g,sex
0,1,3750.0,male
1,2,3800.0,female
2,3,3250.0,female
3,4,,
4,5,3450.0,female


In [44]:
penguins.iloc[0:3,1:4]

Unnamed: 0,species,island,bill_length_mm
0,Adelie,Torgersen,39.1
1,Adelie,Torgersen,39.5
2,Adelie,Torgersen,40.3


In [49]:
heavt_loc=penguins.loc[penguins_data["body_mass_g"]>3000, ["rowid","species","body_mass_g"]]
heavt_loc

Unnamed: 0,rowid,species,body_mass_g
0,1,Adelie,3750.0
1,2,Adelie,3800.0
2,3,Adelie,3250.0
4,5,Adelie,3450.0
5,6,Adelie,3650.0
...,...,...,...
339,340,Chinstrap,4000.0
340,341,Chinstrap,3400.0
341,342,Chinstrap,3775.0
342,343,Chinstrap,4100.0


1. Mixing up labels and positions: If your index is numeric, loc[0] and iloc[0] might seem similar but can differ if the index isn’t 0-based or sequential.
2. Out-of-bounds errors: iloc will error if you use an index beyond the DataFrame’s size, while loc errors on nonexistent labels.
3. loc is label-based: It uses row and column labels (names). For example, if your DataFrame’s rows are indexed by names (e.g., "Alice", "Bob") and columns are labeled (e.g., "Age", "Score"), you use loc with those labels, like df.loc["Alice", "Age"] to get Alice’s age.
4. iloc is integer-based: It uses integer positions (0-based, like array indexing in Python). So, df.iloc[0, 1] grabs the value in the first row, second column, regardless of labels.