<a href="https://colab.research.google.com/github/LearnByDoing2024/Youtube/blob/main/Episode3%2Cpandas_intro_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas from Beginner to Advanced
This notebook will guide you through key features of pandas, from basic data manipulation to advanced operations.

In [1]:
# Import necessary libraries
import seaborn as sns
import pandas as pd

# Load the iris dataset from seaborn
df = sns.load_dataset('iris')

# Display the first few rows of the dataset
df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [2]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## 1. Beginner Level

### 1.1 Introduction to Pandas
Pandas is a powerful library for data manipulation and analysis, especially with structured data like tables and CSV files.

In [None]:
# Importing Pandas
import pandas as pd

# Creating a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston


### 1.2 Loading and Viewing Data
Pandas provides various methods to load data, such as from CSV files, and inspect it using `head()`, `info()`, etc.

In [None]:
# Loading data from a CSV file (Replace with actual file path)
# df = pd.read_csv('path_to_file.csv')

# Viewing first few rows
print(df.head())

# Getting DataFrame info
print(df.info())

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None


## 2. Intermediate Level

### 2.1 Data Selection and Filtering
Selecting columns and filtering rows based on conditions is essential for working with pandas DataFrames.

In [None]:
# Selecting a specific column
ages = df['Age']
print(ages)

# Filtering rows based on a condition
filtered_df = df[df['Age'] > 25]
print(filtered_df)

0    24
1    27
2    22
3    32
Name: Age, dtype: int64
    Name  Age         City
1    Bob   27  Los Angeles
3  David   32      Houston


### 2.2 Basic Data Cleaning
Data cleaning, such as handling missing data or duplicates, is crucial when preparing datasets for analysis.

In [None]:
# Handling missing data
df_with_nan = df.copy()
df_with_nan.loc[2, 'Age'] = None
print('DataFrame with NaN:\n', df_with_nan)

# Filling missing data
df_filled = df_with_nan.fillna(0)
print('Filled DataFrame:\n', df_filled)

# Removing duplicates
df_no_duplicates = df.drop_duplicates()
print('DataFrame without duplicates:\n', df_no_duplicates)

DataFrame with NaN:
       Name   Age         City
0    Alice  24.0     New York
1      Bob  27.0  Los Angeles
2  Charlie   NaN      Chicago
3    David  32.0      Houston
Filled DataFrame:
       Name   Age         City
0    Alice  24.0     New York
1      Bob  27.0  Los Angeles
2  Charlie   0.0      Chicago
3    David  32.0      Houston
DataFrame without duplicates:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston


## 3. Advanced Level

### 3.1 Grouping and Aggregation
Pandas allows grouping of data and performing aggregate operations such as sum, mean, etc.

In [None]:
# Grouping data by a column and calculating the mean
grouped = df.groupby('City')['Age'].mean()
print('Mean age by city:\n', grouped)

Mean age by city:
 City
Chicago        22.0
Houston        32.0
Los Angeles    27.0
New York       24.0
Name: Age, dtype: float64


### 3.2 Merging and Joining DataFrames
Merging DataFrames is a powerful feature that allows combining data from multiple sources.

In [None]:
# Creating another DataFrame
data2 = {
    'Name': ['Alice', 'Bob', 'Eve'],
    'Salary': [50000, 60000, 75000]
}
df2 = pd.DataFrame(data2)

# Merging DataFrames
merged_df = pd.merge(df, df2, on='Name', how='left')
print('Merged DataFrame:\n', merged_df)

Merged DataFrame:
       Name  Age         City   Salary
0    Alice   24     New York  50000.0
1      Bob   27  Los Angeles  60000.0
2  Charlie   22      Chicago      NaN
3    David   32      Houston      NaN


### 3.3 Advanced Data Transformation
Pandas allows applying custom functions to columns for transforming data.

In [None]:
# Applying a custom function to transform a column
df['Age_Category'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Mature')
print('Transformed DataFrame:\n', df)

Transformed DataFrame:
       Name  Age         City Age_Category
0    Alice   24     New York        Young
1      Bob   27  Los Angeles        Young
2  Charlie   22      Chicago        Young
3    David   32      Houston       Mature


The use of the `lambda` function and the `apply()` method to transform a column in a DataFrame based on a custom condition.

Here’s a step-by-step explanation:

### Code Breakdown:

```python
df['Age_Category'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Mature')
```

1. **`df['Age']`**:
   This selects the `Age` column from the DataFrame `df`. The values in this column will be passed to the custom function defined in the next step.

2. **`.apply(lambda x: ...)`**:
   The `apply()` method is used to apply a function to each element of the Series (or column) it's called on. Here, a `lambda` function is used as the custom function.

3. **`lambda x: 'Young' if x < 30 else 'Mature'`**:
   - **`lambda x:`** defines an anonymous function (a function without a name). In this case, the function takes a single argument `x`, which represents each element in the `Age` column.
   - **`'Young' if x < 30 else 'Mature'`** is a conditional expression. It checks if `x` (the age) is less than 30:
     - If `x` is less than 30, it returns the string `'Young'`.
     - Otherwise, it returns `'Mature'`.

4. **`df['Age_Category'] = ...`**:
   This assigns the result of the `apply()` function to a new column in the DataFrame called `Age_Category`. Each element in this new column will be either `'Young'` or `'Mature'` based on the condition applied to the values in the `Age` column.

### Example:

Given the DataFrame:

```plaintext
   Name  Age       City
0  Alice   25   New York
1    Bob   30  Los Angeles
2 Charlie   35     Chicago
3  David   40     Houston
4    Eve   22     Phoenix
```

After applying the transformation:

```plaintext
   Name  Age       City Age_Category
0  Alice   25   New York        Young
1    Bob   30  Los Angeles      Mature
2 Charlie   35     Chicago      Mature
3  David   40     Houston      Mature
4    Eve   22     Phoenix        Young
```

### Key Points:
- **Lambda functions** are small, anonymous functions, useful for simple, one-time operations like this.
- **`apply()` method** allows you to apply a function to every element of a column (or row) in a DataFrame, which is handy for data transformations.
- The result is stored in a new column (`Age_Category`), which can then be used for further analysis.


## Conclusion
Pandas is a versatile library for data manipulation and analysis, making it a must-know tool for data science.