# <center> Pandas Tutorial </center>

üîπ Introduction to Pandas

Pandas is a Python library widely used for data manipulation and analysis.
It provides two main data structures:

Series ‚Üí 1D labeled array (like a single column).

DataFrame ‚Üí 2D labeled table (like an Excel sheet)

In [1]:
# Install pandas (if not installed already)
!pip install pandas

# Import pandas with alias
import pandas as pd



üîπ 1. Pandas Series

A Series is a one-dimensional labeled array.

It can hold data of any type: integers, floats, strings, etc.

Think of it as one column of data in a spreadsheet.

In [2]:
import pandas as pd

# Creating a Series from a Python list
scores = [85, 90, 78, 92]
series_scores = pd.Series(scores)

print(series_scores)

0    85
1    90
2    78
3    92
dtype: int64


üëâ The left side (0, 1, 2, 3) is the index.

üëâ The right side (85, 90, 78, 92) is the data.

‚úÖ Custom Index in Series

We can provide custom labels instead of default numeric index.

In [3]:
marks = pd.Series([85, 90, 78, 92], index=["Math", "Science", "History", "English"])
print(marks)

Math       85
Science    90
History    78
English    92
dtype: int64


üîπ 2. Pandas DataFrame

A DataFrame is a two-dimensional table of rows and columns.

It‚Äôs like a spreadsheet or SQL table.

Each column in a DataFrame is actually a Series.

In [4]:
# Creating a DataFrame from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 22],
    "Marks": [85, 90, 78]
}

df = pd.DataFrame(data)
print(df)

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78


üîπ Series vs DataFrame ‚Äì Key Differences

| Feature   | **Series**              | **DataFrame**                  |
| --------- | ----------------------- | ------------------------------ |
| Dimension | 1D                      | 2D                             |
| Structure | Single column           | Rows √ó Columns (table)         |
| Indexing  | Single axis (index)     | Two axes (rows + columns)      |
| Example   | Student‚Äôs marks in Math | Student info: Name, Age, Marks |
| Analogy   | A column in Excel       | A whole Excel sheet            |


**üîπ Accessing Data**

‚úÖ Accessing Elements in Series

In [5]:
marks = pd.Series([85, 90, 78, 92], index=["Math", "Science", "History", "English"])
print(marks)
# print("Math Marks:", marks["Math"])   # By label - 1
print("First Element:", marks[0])     # By index -2


Math       85
Science    90
History    78
English    92
dtype: int64
First Element: 85


  print("First Element:", marks[0])     # By index -2


‚úÖ Accessing Data in DataFrame

In [6]:
# Creating a DataFrame from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 22],
    "Marks": [85, 90, 78]
}

df = pd.DataFrame(data)
# print(df)


# Access a column (returns a Series)
print(df["Name"])

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


In [7]:
# Access multiple columns
print(df[["Name", "Marks"]])

      Name  Marks
0    Alice     85
1      Bob     90
2  Charlie     78


In [8]:
print(df)
# Access a row using index
print(df.loc[0])   # By label

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
Name     Alice
Age         25
Marks       85
Name: 0, dtype: object


In [9]:
print(df)

print(df.iloc[0])  # By position

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
Name     Alice
Age         25
Marks       85
Name: 0, dtype: object


üìå Difference Between .loc and .iloc
| Feature        | `.loc`                                  | `.iloc`                          |
| -------------- | --------------------------------------- | -------------------------------- |
| **Index type** | Label-based indexing                    | Integer position-based indexing  |
| **Usage**      | Use row **labels** and column **names** | Use row and column **positions** |
| **Inclusive**  | End index is **inclusive**              | End index is **exclusive**       |


üß† .loc[] ‚Äì Label-based indexing

It uses row labels and column names.

‚úÖ Examples:
1. Get the row where the index label is 1:

In [10]:
print(df)
df.loc[1]

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78


Unnamed: 0,1
Name,Bob
Age,30
Marks,90


2. Get rows 0 and 2:

In [11]:
print(df)
df.loc[[0, 2]]

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78


Unnamed: 0,Name,Age,Marks
0,Alice,25,85
2,Charlie,22,78


In [12]:
print(df)
# Create DataFrame with custom row labels
df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
      Name  Age  Marks
a    Alice   25     85
b      Bob   30     90
c  Charlie   22     78


In [13]:
# ‚úÖ Access row with label 'b':
# df.loc['b'] # try
df.loc[['b']] #- for row view

Unnamed: 0,Name,Age,Marks
b,Bob,30,90


üß† .iloc[] ‚Äì Integer position-based indexing

It uses integer indexes, similar to how Python lists work.

‚úÖ Examples:
1. Get the first row (position 0):

In [14]:
df.iloc[0]

Unnamed: 0,a
Name,Alice
Age,25
Marks,85


2. Get the first two rows:

In [15]:
df.iloc[0:2]  # end is exclusive


Unnamed: 0,Name,Age,Marks
a,Alice,25,85
b,Bob,30,90


3. Get the second column (Age) for first and third row:

In [16]:
# Get the second column (Age) for first and third row:
print(df)
df.iloc[[0, 2], 1]

      Name  Age  Marks
a    Alice   25     85
b      Bob   30     90
c  Charlie   22     78


Unnamed: 0,Age
a,25
c,22


üÜö Summary Table
| What you want to do                     | Use    | Example                          |
| --------------------------------------- | ------ | -------------------------------- |
| Get data by label (e.g., row index = 1) | `loc`  | `df.loc[1]`                      |
| Get data by position (e.g., first row)  | `iloc` | `df.iloc[0]`                     |
| Slice by row label and column name      | `loc`  | `df.loc[0:2, ['Name', 'Marks']]` |
| Slice by row/column number (position)   | `iloc` | `df.iloc[0:3, 1:3]`              |


üîπ Converting Series ‚Üí DataFrame

Sometimes we want to convert a Series into a DataFrame

In [17]:
marks = pd.Series([85, 90, 78, 92], index=["Math", "Science", "History", "English"])
print(marks)
print(type(marks))

Math       85
Science    90
History    78
English    92
dtype: int64
<class 'pandas.core.series.Series'>


In [18]:
# Convert Series to DataFrame
marks_df = marks.to_frame(name="Marks")
print(marks_df)
print(type(marks_df))

         Marks
Math        85
Science     90
History     78
English     92
<class 'pandas.core.frame.DataFrame'>


üîπ Real-Life Example

Imagine student records:

Series ‚Üí One subject‚Äôs marks of students.

DataFrame ‚Üí Complete student report with name, age, marks, and grade.

In [19]:
# Series example
math_marks = pd.Series([85, 90, 78], index=["Alice", "Bob", "Charlie"])
print("Math Marks (Series):\n", math_marks)

Math Marks (Series):
 Alice      85
Bob        90
Charlie    78
dtype: int64


In [20]:
# DataFrame example
student_data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 22],
    "Math": [85, 90, 78],
    "Science": [88, 92, 80]

}
df_students = pd.DataFrame(student_data)
print("\nStudent Data (DataFrame):\n", df_students)


Student Data (DataFrame):
       Name  Age  Math  Science
0    Alice   25    85       88
1      Bob   30    90       92
2  Charlie   22    78       80


###üîπ Summary

- Series = 1D labeled array (like a single column).

- DataFrame = 2D table of rows and columns (like a full dataset).

- DataFrame is built from multiple Series.

- Use Series for one column of data, DataFrame for complete datasets.

#Filtering Data in Pandas

In [21]:
import pandas as pd

# Sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 22, 28],
    "Marks": [85, 90, 78, 92]
}

df = pd.DataFrame(data)

# Select single column
# print(df["Name"])
print(df)

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
3    David   28     92


üîπ Filtering Data (Applying Conditions)

Filtering lets us extract rows that satisfy certain conditions.

‚úÖ Example 1: Students older than 25

In [22]:
print(df)
print(df[df["Age"] > 25])

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
3    David   28     92
    Name  Age  Marks
1    Bob   30     90
3  David   28     92


‚úÖ Example 2: Students with Marks ‚â• 90

In [23]:
print(df[df["Marks"] >= 90])

    Name  Age  Marks
1    Bob   30     90
3  David   28     92


‚úÖ Example 3: Combining Multiple Conditions

Use & (AND) and | (OR) operators.

üëâ Remember to wrap each condition in parentheses

In [24]:
# Students older than 25 AND Marks above 85
print(df[(df["Age"] > 25) & (df["Marks"] > 85)])


    Name  Age  Marks
1    Bob   30     90
3  David   28     92


‚úÖ Example 4: Filtering by String Matching

üîç .str.startswith("A") is case-sensitive

This means it will match:

‚úÖ "Alice"

‚ùå "alice"

In [25]:
print(df)
# Names starting with 'A'
print(df[df["Name"].str.startswith("A")])

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
3    David   28     92
    Name  Age  Marks
0  Alice   25     85


‚úÖ To make it case-insensitive, use .str.lower()

df[df["Name"].str.lower().str.startswith("a")]

In [26]:
df[df["Name"].str.lower().str.startswith("a")]

Unnamed: 0,Name,Age,Marks
0,Alice,25,85


‚úÖ Example 5: Filtering with .isin()

Check if a column‚Äôs values are in a list.

In [27]:
# Select students whose names are Alice or David
print(df[df["Name"].isin(["Alice", "David"])])

    Name  Age  Marks
0  Alice   25     85
3  David   28     92


###üîπ Summary of Selection/Filtering

- Column Selection ‚Üí df["col"] or df[["col1","col2"]]

- Row Selection ‚Üí .loc[] (label) and .iloc[] (position)

- Filtering ‚Üí df[condition]

- Combine conditions with & (AND), | (OR)

- Use .str for string conditions

- Use .isin() for multiple values

#Manipulating Tabular Data in Pandas

üîπ Step 1: Create a Sample DataFrame

In [28]:
import pandas as pd

# Sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 22, 28],
    "Marks": [85, 90, 78, 92]
}
df = pd.DataFrame(data)
print(df)

      Name  Age  Marks
0    Alice   25     85
1      Bob   30     90
2  Charlie   22     78
3    David   28     92


üîπ Step 2: Adding New Columns

You can create new columns using calculations or external data.

In [29]:
# Add a new column: Grade
df["Grade"] = ["B", "A", "C", "A"]

# Add a column based on condition
df["Pass"] = df["Marks"] >= 80

print(df)

      Name  Age  Marks Grade   Pass
0    Alice   25     85     B   True
1      Bob   30     90     A   True
2  Charlie   22     78     C  False
3    David   28     92     A   True


üîπ Step 3: Updating Values

In [30]:
# Update single value
# .at[row_label, column_name] is used for fast access to a single value.
# Works only with label-based access, not integer positions.
# Faster than .loc when working with a single cell.
df.at[2, "Marks"] = 72   # Charlie‚Äôs marks updated
print(df)

      Name  Age  Marks Grade   Pass
0    Alice   25     85     B   True
1      Bob   30     90     A   True
2  Charlie   22     72     C  False
3    David   28     92     A   True


In [31]:
# Update entire column (increase all ages by 1)
df["Age"] = df["Age"] + 1
print(df)

      Name  Age  Marks Grade   Pass
0    Alice   26     85     B   True
1      Bob   31     90     A   True
2  Charlie   23     72     C  False
3    David   29     92     A   True


üîπ Step 4: Removing Columns and Rows

In [32]:
# Drop a column
# Calls the .drop() method on the DataFrame df, which removes rows or columns.
df = df.drop("Grade",axis=1) # If you forget axis=1, pandas will assume you're trying to drop a row

# Drop a row (by index)
df = df.drop(0)   # removes Alice‚Äôs row
print(df)

      Name  Age  Marks   Pass
1      Bob   31     90   True
2  Charlie   23     72  False
3    David   29     92   True


üîπ Step 5: Renaming Columns

In [33]:
df = df.rename(columns={"Marks": "Score"}) #Means: change column name "Marks" ‚Üí "Score".
print(df)

      Name  Age  Score   Pass
1      Bob   31     90   True
2  Charlie   23     72  False
3    David   29     92   True


üîπ Step 6: Sorting Data

In [34]:
# Sort by Score (ascending)
# print(df.sort_values(by="Score"))

# # Sort by Age (descending)
#  the sort_values() function to sort a DataFrame by a specific column ‚Äî in this case, "Age" ‚Äî in descending order.
print(df.sort_values(by="Age", ascending=False))

      Name  Age  Score   Pass
1      Bob   31     90   True
3    David   29     92   True
2  Charlie   23     72  False


üîπ Step 7: Grouping Data

Grouping helps summarize data.

In [35]:
# Group by Pass status and calculate average score
grouped = df.groupby("Pass")["Score"].mean()
print(grouped)

Pass
False    72.0
True     91.0
Name: Score, dtype: float64


üîπ Step 8: Aggregations

In [36]:
# print("Mean Age:", df["Age"].mean())
# print("Max Score:", df["Score"].max())
# print("Min Score:", df["Score"].min())
print("Summary:\n", df.describe())

Summary:
              Age      Score
count   3.000000   3.000000
mean   27.666667  84.666667
std     4.163332  11.015141
min    23.000000  72.000000
25%    26.000000  81.000000
50%    29.000000  90.000000
75%    30.000000  91.000000
max    31.000000  92.000000


üîπ Step 9: Handling Missing Data

In [37]:
# Introduce missing value
df.loc[1, "Score"] = None
print(df)

# Fill missing values with average
df["Score"].fillna(df["Score"].mean(), inplace=True)
print(df)

# Drop rows with missing values
df.dropna(inplace=True)

      Name  Age  Score   Pass
1      Bob   31    NaN   True
2  Charlie   23   72.0  False
3    David   29   92.0   True
      Name  Age  Score   Pass
1      Bob   31   82.0   True
2  Charlie   23   72.0  False
3    David   29   92.0   True


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Score"].fillna(df["Score"].mean(), inplace=True)


üîπ Summary of Data Manipulation

Add column ‚Üí df["new"] = ...

Update value ‚Üí df.at[row, "col"]

Drop column/row ‚Üí df.drop()

Rename ‚Üí df.rename()

Sort ‚Üí df.sort_values()

Group/Aggregate ‚Üí df.groupby() + mean(), sum(), etc.

#Summarizing Data in Pandas: groupby & describe

üîπ Step 1: Create a Sample DataFrame

In [38]:
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Department": ["HR", "IT", "HR", "Finance", "IT"],
    "Salary": [50000, 60000, 52000, 58000, 62000],
    "Experience": [2, 5, 3, 7, 4]
}

df = pd.DataFrame(data)
print(df)

      Name Department  Salary  Experience
0    Alice         HR   50000           2
1      Bob         IT   60000           5
2  Charlie         HR   52000           3
3    David    Finance   58000           7
4      Eva         IT   62000           4


üîπ Step 2: Using groupby()

groupby() helps to group rows based on a column and apply aggregations (mean, sum, count, etc.).

‚úÖ Example 1: Average Salary by Department

In [39]:
avg_salary = df.groupby("Department")["Salary"].mean()
print(avg_salary)

Department
Finance    58000.0
HR         51000.0
IT         61000.0
Name: Salary, dtype: float64


‚úÖ Example 2: Multiple Aggregations

In [40]:
summary = df.groupby("Department").agg({
    "Salary": ["mean", "max", "min"],
    "Experience": "mean"
})
print(summary)

             Salary               Experience
               mean    max    min       mean
Department                                  
Finance     58000.0  58000  58000        7.0
HR          51000.0  52000  50000        2.5
IT          61000.0  62000  60000        4.5


üîπ Step 3: Using describe()

describe() provides a statistical summary of numeric columns.

In [41]:
print(df.describe())

             Salary  Experience
count      5.000000    5.000000
mean   56400.000000    4.200000
std     5176.871642    1.923538
min    50000.000000    2.000000
25%    52000.000000    3.000000
50%    58000.000000    4.000000
75%    60000.000000    5.000000
max    62000.000000    7.000000


üîπ Step 4: Groupby + Describe

We can even combine both!

In [42]:
# Groups the DataFrame by unique values in the "Department" column.Focuses on the "Salary" column within each group.
dept_summary = df.groupby("Department")["Salary"].describe()
print(dept_summary)

            count     mean          std      min      25%      50%      75%  \
Department                                                                    
Finance       1.0  58000.0          NaN  58000.0  58000.0  58000.0  58000.0   
HR            2.0  51000.0  1414.213562  50000.0  50500.0  51000.0  51500.0   
IT            2.0  61000.0  1414.213562  60000.0  60500.0  61000.0  61500.0   

                max  
Department           
Finance     58000.0  
HR          52000.0  
IT          62000.0  


üîπ Summary

groupby() ‚Üí Groups rows and applies aggregation (mean, sum, count, etc.).

agg() ‚Üí Allows multiple aggregations on different columns.

describe() ‚Üí Gives descriptive statistics (count, mean, std, min, quartiles, max).

groupby() + describe() ‚Üí Combines grouping and summary statistics.