## What is a DataFrame in Python?

A **DataFrame** is a **2-dimensional data structure** (like a table) used to store data in rows and columns.

In Python, it is mainly provided by the library **pandas**.

---

### üîπ Simple Definition (Exam Friendly)

A **DataFrame** is a tabular data structure with labeled rows and columns, used to store and manipulate structured data.

---

### üîπ Think Like This

It looks like an Excel sheet:

| Name | Age | Marks |
| --- | --- | --- |
| Pritam | 22 | 90 |
| Rahul | 21 | 85 |

Each:

- **Column** has a name (Name, Age, Marks)
- **Row** has an index (0, 1, 2‚Ä¶)
- Data can be different types (int, string, float, etc.)

---

### üîπ Why We Use DataFrame?

- Store structured data
- Filter data
- Select rows/columns
- Perform calculations
- Analyze large datasets easily

Very useful in:

- Data Analysis
- Machine Learning
- Data Cleaning

---

### üîπ Example in Python

```python
import pandasas pd

data = {"Name": ["Pritam","Rahul"],"Age": [22,21],"Marks": [90,85]
}

df = pd.DataFrame(data)print(df)
```

Here:

- `pd` is pandas
- `DataFrame()` creates the table

---

### üîπ Internally (Important Concept)

A DataFrame is:

- A collection of **Series**
- Each column is a **Series**
- Indexed both by rows and columns

----

- Creating a DataFrame  
- Selection and Indexing of Columns  
- Creating a New Column  
- Removing a Column  
- Selecting Rows  
- Selecting Subsets of Rows and Columns  
- Conditional Selection  


In [8]:
import numpy as np
import pandas as pd

In [9]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'salary': [70000, 80000, 90000, 100000, 110000]
}

In [10]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,salary
0,Alice,25,New York,70000
1,Bob,30,Los Angeles,80000
2,Charlie,35,Chicago,90000
3,David,40,Houston,100000
4,Eve,45,Phoenix,110000


In [11]:
data_list = [
    ['Alice', 25, 'New York', 70000],
    ['Bob', 30, 'Los Angeles', 80000],
    ['Charlie', 35, 'Chicago', 90000],
    ['David', 40, 'Houston', 100000],
    ['Eve', 45, 'Phoenix', 110000]
]
columns = ['Name', 'Age', 'City', 'Salary']
# if we create a DataFrame from a list, we need to specify the column names
df_list = pd.DataFrame(data_list, columns=columns) 
df_list

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,70000
1,Bob,30,Los Angeles,80000
2,Charlie,35,Chicago,90000
3,David,40,Houston,100000
4,Eve,45,Phoenix,110000


### Selection and Indexing of Columns

In [12]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      str  
 1   Age     5 non-null      int64
 2   City    5 non-null      str  
 3   salary  5 non-null      int64
dtypes: int64(2), str(2)
memory usage: 292.0 bytes


In [13]:
df["Name"]

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: str

In [None]:
df["Name", "Age"]
# ‚ùå this will raise an error because for selecting multiple columns, we need to list insted of tuple

In [15]:
df[["Name", "Age"]]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,40
4,Eve,45


In [16]:
# add new column to the DataFrame (we have to make sure that the length of the new column matches the number of rows in the DataFrame)
df['Designation'] = ['Engineer', 'Manager', 'Director', 'VP', 'CEO']
df

Unnamed: 0,Name,Age,City,salary,Designation
0,Alice,25,New York,70000,Engineer
1,Bob,30,Los Angeles,80000,Manager
2,Charlie,35,Chicago,90000,Director
3,David,40,Houston,100000,VP
4,Eve,45,Phoenix,110000,CEO


In [None]:
df.drop('Designation')  # ‚ùå this will give error because by defult axis=0 which means drop row, but we want to drop column so we need to specify axis=1

In [19]:
df_modified = df.drop('Designation', axis=1) # this will drop the 'Designation' column from the DataFrame, but it will return a new DataFrame without modifying the original one
df_modified

Unnamed: 0,Name,Age,City,salary
0,Alice,25,New York,70000
1,Bob,30,Los Angeles,80000
2,Charlie,35,Chicago,90000
3,David,40,Houston,100000
4,Eve,45,Phoenix,110000


In [20]:
df # after modifying the DataFrame, we can see that the original DataFrame is not modified because we used inplace=False by default in the drop method, which means it will return a new DataFrame without modifying the original one

Unnamed: 0,Name,Age,City,salary,Designation
0,Alice,25,New York,70000,Engineer
1,Bob,30,Los Angeles,80000,Manager
2,Charlie,35,Chicago,90000,Director
3,David,40,Houston,100000,VP
4,Eve,45,Phoenix,110000,CEO


In [21]:
df.drop('Designation', axis=1, inplace=True) # drop the 'Designation' column from the DataFrame

In [22]:
df # after modifying the DataFrame, we can see that the original DataFrame is modified because we used inplace=True in the drop method, which means it will modify the original DataFrame and return None

Unnamed: 0,Name,Age,City,salary
0,Alice,25,New York,70000
1,Bob,30,Los Angeles,80000
2,Charlie,35,Chicago,90000
3,David,40,Houston,100000
4,Eve,45,Phoenix,110000


In [None]:
df.drop(4, axis=0, inplace=True) # this will drop the row with index 4 from the DataFrame, and it will modify the original DataFrame in place

In [25]:
df

Unnamed: 0,Name,Age,City,salary
0,Alice,25,New York,70000
1,Bob,30,Los Angeles,80000
2,Charlie,35,Chicago,90000
3,David,40,Houston,100000


----
## Selecting Rows

In [26]:
df.loc[0] # this will return the first row of the DataFrame as a Series

Name         Alice
Age             25
City      New York
salary       70000
Name: 0, dtype: object

In [27]:
df.loc[[0, 1]] # this will return the first and second rows of the DataFrame as a DataFrame

Unnamed: 0,Name,Age,City,salary
0,Alice,25,New York,70000
1,Bob,30,Los Angeles,80000


In [28]:
df.iloc[0] # this will return the first row of the DataFrame as a Series

Name         Alice
Age             25
City      New York
salary       70000
Name: 0, dtype: object

----
## Selecting Subset of Rows and Columns

In [32]:
df.loc[[0,1]][['City', 'salary']] # this will return the 'City' and 'salary' columns for the first and second rows of the DataFrame as a DataFrame

Unnamed: 0,City,salary
0,New York,70000
1,Los Angeles,80000


### Conditional Selection
----

In [None]:
# only want to select the people who are older than 30 years old

df[df['Age'] > 30] # this will return a DataFrame with only the rows where the 'Age' column is greater than 30

Unnamed: 0,Name,Age,City,salary
2,Charlie,35,Chicago,90000
3,David,40,Houston,100000


In [None]:
# only want to select the people who are older than 30 years old and live in 'Chicago'

df[(df['Age'] > 30) & (df['City'] == 'Chicago')] # this will return a DataFrame with only the rows where the 'Age' column is greater than 30 and the 'City' column is equal to 'Chicago'

Unnamed: 0,Name,Age,City,salary
2,Charlie,35,Chicago,90000
