### Module 1: Introduction to Pandas

#### 1. **What is Pandas?**

Pandas is a powerful and flexible open-source data analysis and manipulation library for Python. It provides data structures and functions needed to efficiently manipulate large datasets, perform data cleaning, and conduct data analysis. The core data structures in Pandas are the Series and DataFrame.

- **Key Features:**
  - **Data Manipulation:** Provides tools for reshaping and pivoting datasets.
  - **Data Cleaning:** Includes functions for handling missing data, duplicates, and data transformation.
  - **Data Aggregation:** Facilitates grouping, summarizing, and aggregating data.
  - **Integration:** Works seamlessly with other data science libraries like NumPy, SciPy, and Matplotlib.

- **Use Cases:**
  - Data cleaning and preparation
  - Exploratory data analysis (EDA)
  - Time series analysis
  - Data visualization

#### 2. **Installation and Setup**

To use Pandas, you need to install it using Python's package manager, `pip`. You can install Pandas via the command line or terminal.

- **Installation Command:**
  ```bash
  pip install pandas
  
# Verification

After installation, you can verify the installation by importing Pandas and checking its version:

```python
import pandas as pd
print(pd.__version__)


### 3. Pandas Data Structures

Pandas provides two primary data structures: **Series** and **DataFrame**. These structures are fundamental for data analysis and manipulation in Pandas.

#### 3.1. **Series**

- **Description:**
  A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). It is akin to a column in a DataFrame or a list with labels, providing additional functionalities compared to a basic list or NumPy array.

- **Creation:**
  You can create a Series from different data structures like lists, dictionaries, or scalar values.

  ```python
  import pandas as pd

  # Creating a Series from a list
  data = [10, 20, 30, 40]
  series = pd.Series(data)
  print(series)
  # Output:
  # 0    10
  # 1    20
  # 2    30
  # 3    40
  # dtype: int64


## Creating a Series with Custom Index:
- Custom indices can be used to label the data points, which is helpful for better data understanding and manipulation.
```python
import pandas as pd
index = ['edukron', 'aiml', 'ml', 'dl']
data = [10, 20, 30, 40]
series = pd.Series(data, index=index)
print(series)


### Accessing Elements

- You can access elements using both integer-based indexing and label-based indexing.

```python
# Accessing an element using integer-based indexing
print(series[1])  # Output: 200

# Accessing an element using label-based indexing
print(series['ml'])  # Output: 300

### Slicing:
- You can perform slicing operations to access a range of elements.

```python

# Slicing with integer-based indexing
print(series[1:3])
# Output:
# aiml    200
# ml      300
# dtype: int64

# Slicing with label-based indexing
print(series['aiml':'dl'])
# Output:
# aiml    200
# ml      300
# dl      400
# dtype: int64







### Operations on Series:
- You can perform various operations on Series, such as arithmetic operations, statistical computations, and applying functions.
```python
# Arithmetic operation
print(series + 10)
# Output:
# edukron    110
# aiml       210
# ml         310
# dl         410
# dtype: int64
# Statistical operations
print(series.mean())  # Output: 250.0
print(series.std())   # Output: 129.10
# Applying a function
print(series.apply(lambda x: x ** 2))
# Output:
# edukron    10000
# aiml       40000
# ml         90000
# dl        160000
# dtype: int64


### Handling Missing Data:
- Series can handle missing data using NaN values.

```python

data_with_nan = [1, 2, None, 4]
series_with_nan = pd.Series(data_with_nan)
print(series_with_nan)
# Output:
# 0    1.0
# 1    2.0
# 2    NaN
# 3    4.0
# dtype: float64

# Checking for missing values
print(series_with_nan.isna())
# Output:
# 0    False
# 1    False
# 2     True
# 3    False
# dtype: bool

# Filling missing values
filled_series = series_with_nan.fillna(0)
print(filled_series)
# Output:
# 0    1.0
# 1    2.0
# 2    0.0
# 3    4.0
# dtype: float64


### String Operations:
- Series with string data type supports vectorized string operations.

```python
text_data = ['edukron', 'aiml', 'ml', 'dl']
text_series = pd.Series(text_data)

# Converting to uppercase
print(text_series.str.upper())
# Output:
# 0    EDUKRON
# 1       AIML
# 2         ML
# 3         DL
# dtype: object

# Finding the length of each string
print(text_series.str.len())
# Output:
# 0    7
# 1    4
# 2    2
# 3    2
# dtype: int64


# 3.2 DataFrame:

### 3.2. **DataFrame**

- **Description:**
  A DataFrame in Pandas is a two-dimensional labeled data structure that can hold data of different types (e.g., integers, floats, strings) across columns. It is akin to a table in a database or a spreadsheet, offering a powerful tool for data manipulation and analysis.

  **Key Characteristics:**
  - **Columns:** Each column in a DataFrame can be of a different data type.
  - **Indexing:** DataFrames have both row and column labels (indices), allowing for more flexible data access and manipulation.
  - **Alignment:** Automatic alignment of data based on row and column labels ensures consistency when performing operations on multiple DataFrames.

- **Creating a DataFrame:**
  DataFrames can be created from various data structures such as dictionaries, lists, and even other DataFrames.

  **From a Dictionary:**
  ```python
  import pandas as pd

  # Creating a DataFrame from a dictionary
  data = {
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'City': ['New York', 'Los Angeles', 'Chicago']
  }
  df = pd.DataFrame(data)
  print(df)
  # Output:
  #       Name  Age         City
  # 0    Alice   25     New York
  # 1      Bob   30  Los Angeles
  # 2  Charlie   35      Chicago


### From a List of Dictionaries:

```python
# Creating a DataFrame from a list of dictionaries
data_list = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df_from_list = pd.DataFrame(data_list)
print(df_from_list)
# Output:
#       Name  Age         City
# 0    Alice   25     New York
# 1      Bob   30  Los Angeles
# 2  Charlie   35      Chicago


### From a List of Lists:
```python
# Creating a DataFrame from a list of lists with custom column names
data_list_of_lists = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]
df_from_lists = pd.DataFrame(data_list_of_lists, columns=['Name', 'Age', 'City'])
print(df_from_lists)
# Output:
#       Name  Age         City
# 0    Alice   25     New York
# 1      Bob   30  Los Angeles
# 2  Charlie   35      Chicago



### Custom Index:
```python
index = ['person1', 'person2', 'person3']
df_custom_index = pd.DataFrame(data, index=index)
print(df_custom_index)
# Output:
#            Name  Age         City
# person1    Alice   25     New York
# person2      Bob   30  Los Angeles
# person3  Charlie   35      Chicago



# Accessing Data:
- Accessing data in a DataFrame can be done using labels or integer-based indices. You can also use methods for more complex data access.

### Accessing Columns:
```python
# Accessing a column
print(df['Name'])
# Output:
# 0      Alice
# 1        Bob
# 2    Charlie
# Name: Name, dtype: object




### Accessing Rows by Label:
```python
# Accessing a row by label
print(df_custom_index.loc['person1'])
# Output:
# Name    Alice
# Age        25
# City   New York
# Name: person1, dtype: object


### Accessing Rows by Integer Position:

```python
# Accessing a row by integer position
print(df.iloc[0])
# Output:
# Name    Alice
# Age        25
# City   New York
# Name: 0, dtype: object


### Accessing Multiple Rows and Columns:
```pyhton
# Accessing multiple rows and columns
print(df_custom_index.loc[['person1', 'person2'], ['Name', 'City']])
# Output:
#            Name         City
# person1    Alice     New York
# person2      Bob  Los Angeles


## Adding and Removing Columns:
- Adding and removing columns from a DataFrame is straightforward and allows for flexible data manipulation.

### Adding a New Column:
```python
# Adding a new column
df['Occupation'] = ['Engineer', 'Artist', 'Doctor']
print(df)
# Output:
#       Name  Age         City  Occupation
# 0    Alice   25     New York   Engineer
# 1      Bob   30  Los Angeles     Artist
# 2  Charlie   35      Chicago     Doctor


### Removing a Column:
```python
# Removing a column
df = df.drop(columns=['Occupation'])
print(df)
# Output:
#       Name  Age         City
# 0    Alice   25     New York
# 1      Bob   30  Los Angeles
# 2  Charlie   35      Chicago



## DataFrame Operations:
- DataFrames support a wide range of operations including aggregation, merging, and reshaping.

### Aggregation:
```python
# Aggregation - Descriptive statistics
print(df.describe())
# Output:
#              Age
# count   3.000000
# mean   30.000000
# std     5.000000
# min    25.000000
# 25%    27.500000
# 50%    30.000000
# 75%    32.500000
# max    35.000000


### Merging DataFrames:

```python
# Merging DataFrames on a common column
df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Country': ['USA', 'USA']
})
merged_df = pd.merge(df, df2, on='Name', how='inner')
print(merged_df)
# Output:
#       Name  Age         City Country
# 0    Alice   25     New York     USA
# 1      Bob   30  Los Angeles     USA


### Concatenation:
```python

# Concatenating DataFrames along rows
df3 = pd.DataFrame({
    'Name': ['David', 'Eve'],
    'Age': [40, 45],
    'City': ['Miami', 'San Francisco']
})
concatenated_df = pd.concat([df, df3], ignore_index=True)
print(concatenated_df)
# Output:
#        Name  Age             City
# 0    Alice   25       New York
# 1      Bob   30    Los Angeles
# 2  Charlie   35        Chicago
# 3    David   40           Miami
# 4      Eve   45  San Francisco



### Reshaping:
```python
# Reshaping DataFrame - Pivot
pivot_df = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-02'],
    'Category': ['A', 'A', 'B', 'B'],
    'Value': [10, 20, 30, 40]
})
pivot_table = pivot_df.pivot(index='Date', columns='Category', values='Value')
print(pivot_table)
# Output:
# Category     A     B
# Date
# 2023-01-01  10    30
# 2023-01-02  20    40



# Handling Missing Data:
- Missing data is common in real-world datasets. Pandas provides robust methods to handle missing values.

### Checking for Missing Values:
```python
# Checking for missing values
data_with_nan = {
    'Name': ['Alice', 'Bob', None],
    'Age': [25, None, 35],
    'City': ['New York', 'Los Angeles', None]
}
df_with_nan = pd.DataFrame(data_with_nan)
print(df_with_nan.isna())
# Output:
#    Name    Age   City
# 0  False  False  False
# 1  False   True  False
# 2   True  False
