### **Introduction to Pandas**

- **Pandas** is a `Python library used for data manipulation and analysis`. It provides easy-to-use **data structures** like `Series (1D) and DataFrame (2D)` that help you work with structured data, like tables or spreadsheets

- Pandas is a fast, powerful, flexible and easy to use open source data analysis and **manipulation tool**,
built on top of the Python programming language.


* The name of Pandas is derived from the word `Panel Data.` which refers to multi-dimensional structured datasets.
* With Pandas we interact with Real World Data by Converting them into Data Frames
* It makes tasks like `cleaning, filtering, and analyzing data much easier`.

* NumPy = `Arrays`
* Pandas = `Series` and `Data Frame` (Large Arrays)

#### **Pandas Basic Commands**

---

- ! pip install Pandas
- ! pip uninstall Pandas
- ! pip install --upgrade pandas
- ! pip show pandas
- `pd.__version__  #(major_number.minor_number.revision_number)` (Big update, Small update, Bug Fix)

In [2]:
import pandas as pd

### Checking Pandas Version

In [2]:
pd.__version__

'2.2.3'

## Applications of Pandas

![Applications](https://miro.medium.com/v2/resize:fit:890/1*VmfrWgMp9DXZx5NplsfU5A.png)

## **Types of Data in Pandas**

### 1. **Categorical Data** 
   - Data that represents categories or groups.  
   - Example: `"Male"`, `"Female"`, `"Red"`, `"Blue"`, `"Apple"`, `"Orange"`.  
   - Stored as `category` dtype in Pandas.  
   - **Efficient for memory usage and performance.**  

In [3]:
data = pd.Series(["Red", "Blue", "Red", "Green"], dtype="category")
print(data)

0      Red
1     Blue
2      Red
3    Green
dtype: category
Categories (3, object): ['Blue', 'Green', 'Red']


### 2. **Object Data (String/Text Data)** 
   - Used for **text or mixed-type data** (strings, alphanumeric values).  
   - Example: `"John"`, `"Apple123"`, `"Hello World"`.  
   - Stored as `object` dtype in Pandas.  


In [4]:
data = pd.Series(["Apple", "Banana", "Cherry"])
print(data.dtype)  # Output: object

object


### 3. **Numerical Data**  
   - **Integer (`int64`)**: Whole numbers like `1, 100, -50, 2000`.  
   - **Floating-Point (`float64`)**: Decimal numbers like `3.14, 2.718, -0.5`.  
   - Example: Age, Salary, Temperature, Price. 

In [5]:
data = pd.Series([10, 20.5, 30, 40.7])
print(data.dtype)  # Output: float64

float64


## help function()

In [None]:
help(pd.Series)

### Show documentation about the function in the notebook.

In [None]:
pd.Series?

### Using dir() to List Available Methods

In [None]:
dir(pd.Series)

## **1. Data Creation Functions**

---
## **Series**
- A **Series** in Pandas is a `one-dimensional array-like object` that can hold any data type (integers, strings, floats, etc.).
- A Pandas Series is like a `column in a table`.
- A **Series** has an **index** (labels) and **values** (data).
- It can be created from lists, arrays, or dictionaries.
- It allows you to perform operations like filtering, mapping, and aggregating data.
- **`pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`**


### 1. Creating Series using list/tuple

In [9]:
data = ["Pavan","Kumar","Ande"]
t=(10,20,30,40)
s = pd.Series(data) # list
s1=pd.Series(t) # tuple
print(s)
print(s1)

0    Pavan
1    Kumar
2     Ande
dtype: object
0    10
1    20
2    30
3    40
dtype: int64


**Note** : **`List/Array → Index is assigned by Pandas (0,1,2,... by default)`**

### Label
- In **Pandas**, a **label** refers to the `identifier` assigned to each **index** or **column** in a data structure like a `Series or DataFrame.
- Labels help to reference data elements efficiently, making data manipulation, access, and analysis much more intuitive.

In [10]:
labels = ['FirstName','LastName','SurName']
s = pd.Series(data,index = labels) # pd.Series(data,labels) this also works we can give any names instead of data,labels
print(s)

FirstName    Pavan
LastName     Kumar
SurName       Ande
dtype: object


### 2. Using a NumPy Array

In [11]:
import numpy as np
arr = np.array([1,2,3,4])
s = pd.Series(arr)
print(s)

0    1
1    2
2    3
3    4
dtype: int64


### 3. Creating Series using Dictionary 

In [12]:
d = {'a':10,'b':20,'c':30,'d':40}
s = pd.Series(d)
print(s)

a    10
b    20
c    30
d    40
dtype: int64


**Note**:  Pandas Series using a dictionary, **`you don't need to explicitly define the index`** because Pandas **`automatically uses the dictionary keys as the index`**.

In [13]:
labels = ['i1','i2','i3','i4']
s = pd.Series(d,labels)
print(s)

i1   NaN
i2   NaN
i3   NaN
i4   NaN
dtype: float64


**Note:** 
- If you **explicitly specify an index** while creating a Series from a **dictionary**, Pandas **`rearranges the values according to the new index`**.
-  If a `label is present` in the dictionary, its **`corresponding value is used`**; if **not**, Pandas fills it with **`NaN`** (Not a Number, representing a missing value).


### **Data selection in Series (Indexing)**

In [14]:
s = pd.Series([10, 20, 30, 40])
print(s)

0    10
1    20
2    30
3    40
dtype: int64


#### **1. Default Indexing (Integer-Based)**

In [15]:
print(s[0]) 
print(s[2])

10
30


#### **2. Custom Indexing (Labels)**

In [16]:
s = pd.Series([100, 200, 300], index=["a", "b", "c"])
print(s)

a    100
b    200
c    300
dtype: int64


In [17]:
print(s["a"])
print(s["c"])

100
300


#### **3. Boolean Indexing**

In [18]:
s = pd.Series([10, 20, 30, 40])
print(s[s > 20])  

2    30
3    40
dtype: int64


#### **4. Slicing**

In [19]:
print(s[1:3])  

1    20
2    30
dtype: int64


In [20]:
s = pd.Series([100, 200, 300, 400], index=["a", "b", "c", "d"])
print(s["b":"d"])

b    200
c    300
d    400
dtype: int64


---
## **DataFrame**
- A **DataFrame** is a **`2D table-like structure`** in Pandas, similar to an **Excel sheet** or **SQL table**.
- It consists of **rows** and **columns**, where **`each column is a Series`**.
- Each **column** can have a `different data type (int, float, string, etc.)`.
- Can be created using `lists, dictionaries, NumPy arrays, CSV, SQL, etc`.
- **`pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`**

  

#### **Note**
#### If the column values have **different lengths**, Pandas will raise a `ValueError`.
#### All columns must have the **`same number of elements`** when creating a DataFrame

### **Creating a DataFrame**

#### **1. Using a Dictionary**

In [4]:
data = {
    "Name" : ["Pavan","Hema"],
    "Age" : [23,22],
    "Salary" : [23000,28000]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary
0,Pavan,23,23000
1,Hema,22,28000


#### **2. Using a List of Lists**

In [35]:
df = pd.DataFrame([
    ["Alice", 25, 50000],
    ["Bob", 30, 60000],
    ["Charlie", 35, 70000]
], columns=["Name", "Age", "Salary"])
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000


#### **3. Using Numpy array**

In [36]:
arr = np.arange(1,10).reshape(3,3)
rows = ['row-1','row-2','row-3']
cols =['col-1','col-2','col-3']
df = pd.DataFrame(arr,index=rows,columns=cols)
df

Unnamed: 0,col-1,col-2,col-3
row-1,1,2,3
row-2,4,5,6
row-3,7,8,9


#### **4. Creating a DataFrame from a Series**
If you have multiple Pandas Series, you can combine them into a DataFrame.

In [5]:
name = pd.Series(["Pavan", "Hema", "Kumar"])
age = pd.Series([23, 22, 25])
salary = pd.Series([23000, 28000, 30000])

df = pd.DataFrame({"Name": name, "Age": age, "Salary": salary})
df

Unnamed: 0,Name,Age,Salary
0,Pavan,23,23000
1,Hema,22,28000
2,Kumar,25,30000


#### **5. Using List of Dictionaries**
(Missing values are filled with NaN)

In [9]:
data = [
    {"Name": "Alice", "Age": 25},
    {"Name": "Bob", "Age": 30, "City": "Paris"},
    {"Name": "xyz", "Age": 23, "City": "London"},
    {"Name": "abc", "Age": 22, "City": "USA"}
]
df = pd.DataFrame(data,index=['A1','A2','A3','A4'])
df

Unnamed: 0,Name,Age,City
A1,Alice,25,
A2,Bob,30,Paris
A3,xyz,23,London
A4,abc,22,USA


In [50]:
type(df)

pandas.core.frame.DataFrame

In [51]:
type(df['Name'])

pandas.core.series.Series

---
### **Accessing Rows and Columns in DataFrame**

#### **1. Accessing Columns**

- The column name as a key: **`df['column_name']`**
- **Dot notation**: **`df.column_name`** (only if column name is a valid attribute)

In [10]:
df['Name']

A1    Alice
A2      Bob
A3      xyz
A4      abc
Name: Name, dtype: object

In [60]:
df.Age

A1    25
A2    30
A3    23
A4    22
Name: Age, dtype: int64

---
### **Q: Why is `df['col']` preferred over `df.col` when accessing columns in Pandas?**  

#### **Answer:**  
Using `df['col']` is better than `df.col` because:  

1. **Supports all column names** – Works even if the column name has spaces or special characters (`df['Employee Name']` ✅, but `df.Employee Name` ❌).  
2. **Avoids conflicts** – Some column names may clash with built-in Pandas attributes (`df.count` refers to a method, not a column).  
3. **Allows dynamic access** – Works with variables (`df[col_name]` ✅, but `df.col_name` ❌).  
4. **More reliable** – If a column is deleted or replaced, `df.col` may fail, whereas `df['col']` remains robust.  

👉 **Best Practice:** Always use `df['col']` for consistent and error-free Pandas operations. 🚀

---

#### **2. Accessing Rows**
- **`iloc[]`** for positional indexing (integer-based)
- **`loc[]`** for label-based indexing

In [13]:
# Access the first row (index 0)
df.iloc[0]

Name    Alice
Age        25
City      NaN
Name: A1, dtype: object

In [12]:
# Access row with label 0 (the index label)
df.loc['A2']

Name      Bob
Age        30
City    Paris
Name: A2, dtype: object

#### **3. Accessing Specific Value**
- Using loc[] for label-based indexing
- Using iloc[] for positional indexing
- **`df.iloc[row_index, column_index]`** or **`df.loc[row_label, column_name]`**

In [16]:
# Access the 'Age' of the row where the index label is 1
print(df.loc['A1', 'Age'])


25


In [18]:
# Access the value at the 1st row (index 1) and 2nd column (index 1) - 'Age'
print(df.iloc[1, 1])


30


#### **4. Accessing Multiple Rows and Columns Using Slicing**

In [58]:
# Access rows 0 and 1, columns 'Name' and 'Salary'
print(df.loc['A2':'A4', ['Name', 'City']])


   Name    City
A2  Bob   Paris
A3  xyz  London
A4  abc     USA


In [19]:
# Access the first two rows and first two columns
print(df.iloc[0:2, 0:2])

     Name  Age
A1  Alice   25
A2    Bob   30
