# 1. What is Pandas

Pandas is a Python library used for working with data in table format (like Excel spreadsheets). It makes it easy to read, clean, analyze, and manipulate structured data.

## Main Uses
- Loading data from files (CSV, Excel, databases)
- Cleaning messy data (handling missing values, duplicates)
- Analyzing and exploring datasets
- Transforming and reshaping data
- Preparing data for machine learning or visualization

# 2. Importing the Library
- `import pandas as pd` 
- Storing the library under the shortcut `pd`

In [1]:
import pandas as pd

# 3. Pandas Methods

## 3.1 Pandas Series

- A DataFrame is a table with rows & columns
- Every column with the values under it is called a **Series**
- A Series is a one-dimensional array (like a single column in Excel)
- **I can create multiple Series and combine them together to form a DataFrame**
- It's similar to a list in Python, but with labels (index) and optimized for data analysis
- Selecting an item from the Series is done by using the index
- We can also rename the indexing instead of it being just numbers

In [None]:
# Pandas series (List -> Series)

a1 = [1, 7, 2]

s1 = pd.Series(a1) # Create a series from the list. The Left column is the index while the right column is the actual data
s1 = pd.Series(a1, index=["a","b","c"]) # if i wanna rename the index (optional)
print("a1(List):")
print (a1)
print("\n")
print("s1(Series):")
print(s1)

# Example usage, get name of students from source and their grades from another source. Merge Both series to create a data frame

# Fetch something from the series using the index
print("\nS1[0]:")
print(s1["a"]) # using the index 


a1(List):
[1, 7, 2]


s1(Series):
a    1
b    7
c    2
dtype: int64

S1[0]:
1


In [20]:
# Dictionary -> Series

# Here the Key Becomes the Index and the Value becomes the Value 

calories = {"Day 1": 420, "Day 2": 380, "Day 3": 390}

s2 = pd.Series(calories)

print("S2:")
print(s2)

print("\nS2[Day 1]")
print(s2["Day 1"])



S2:
Day 1    420
Day 2    380
Day 3    390
dtype: int64

S2[Day 1]
420


## 3.2 Pandas DataFrame
- The **pd.DataFrame()** method creates a DataFrame (a table) from your data.
- What it does:
    - Takes data in various formats (dictionary, list, array, etc.)
    - Converts it into a structured table with rows and columns
    - Automatically assigns column names and row indices

In [None]:
# We create a dictionary with our data 
# {Key : [...data]}, where the Key is our column name and the array is the values of the column in order 
# After Creating the dataframe it assigns it's own indexing

data1 = {
    "calories": [420, 380, 390],
    "durations": [10, 30, 50]
}


df1 = pd.DataFrame(data1)

print(df1)

   calories  durations
0       420         10
1       380         30
2       390         50


### 3.2.1 Accessing Specific Columns In DataFrame

- use **df['column_name']** to return a single column as a Series
- use **df[['column1', 'column2']]** to return multiple columns as a DataFrame (note the double brackets)

In [30]:

# Print single column (returns Series)
print("Single column - calories:")
print(df1['calories'])
print()

# Print multiple columns (returns DataFrame)
print("Multiple columns - calories and durations:")
print(df1[['calories', 'durations']])
print()


Single column - calories:
0    420
1    380
2    390
Name: calories, dtype: int64

Multiple columns - calories and durations:
   calories  durations
0       420         10
1       380         30
2       390         50



### 3.2.2 Finding Rows In Data Frame

1. use **loc[index]** to return a single row. We use the index of that row
2. use **loc[[list_of_indexes]]** to return multiple rows, where list_of_indexes are the indexes of the rows you want to return
3. use **loc[star_indext:end_index]** to return a range of rows from start to end (inclusive). Example: `df1.loc[0:5]` returns rows 0 through 5
4. use **loc[condition]** to filter rows based on a condition. Example: `df1.loc[df1['Age'] > 25]` returns all rows where Age is greater than 25
5. use **df.loc[:, 'column_name']** to return all rows for a specific column
6. use **df.loc[:, ['column1', 'column2']]** to return all rows for multiple columns
7. use **df.loc[rows, columns]** to combine row and column selection, 
    - where rows is either: 
        - a `list` of row numbers (list of indicies) if multiple rows
        - a range of rows `-:-` 
        - a single row so it's just a `number` 
    - the columns are also:
        - a `list` of the column names if multiple columns (list of strings)
        - just a single column `string` if a single column

In [35]:
# Printing Single Row by its index
print("Single Row (index 1):")
print(df1.loc[1])
print()

# Printing Multiple Rows by their corresponding indices
print("Multiple Rows (index 0 and 2):")
print(df1.loc[[0, 2]])
print()

# Printing Multiple rows in a specific Range 
print("Range of Rows (0 through 2):")
print(df1.loc[0:2])
print()

# Printing Rows After applying filtration
print("Rows where calories > 390:")
print(df1.loc[df1['calories'] > 390])
print()

print("Rows where durations == 30:")
print(df1.loc[df1['durations'] == 30])
print()

print("Rows where calories > 380 AND durations < 40:")
print(df1.loc[(df1['calories'] > 380) & (df1['durations'] < 40)])
print()

# Using loc for single column (all rows)
print("All rows, calories column:")
print(df1.loc[:, 'calories'])
print()

# Using loc for multiple columns (all rows)
print("All rows, both columns:")
print(df1.loc[:, ['calories', 'durations']])
print()

# Combining row and column selection
print("Rows 0-1, calories column:")
print(df1.loc[0:1, 'calories'])
print()

# Combining row and column selection
print("Row 0, calories column:")
print(df1.loc[0, 'calories'])
print()

print("Rows 0 and 2, both columns:")
print(df1.loc[[0, 2], ['calories', 'durations']])

Single Row (index 1):
calories     380
durations     30
Name: 1, dtype: int64

Multiple Rows (index 0 and 2):
   calories  durations
0       420         10
2       390         50

Range of Rows (0 through 2):
   calories  durations
0       420         10
1       380         30
2       390         50

Rows where calories > 390:
   calories  durations
0       420         10

Rows where durations == 30:
   calories  durations
1       380         30

Rows where calories > 380 AND durations < 40:
   calories  durations
0       420         10

All rows, calories column:
0    420
1    380
2    390
Name: calories, dtype: int64

All rows, both columns:
   calories  durations
0       420         10
1       380         30
2       390         50

Rows 0-1, calories column:
0    420
1    380
Name: calories, dtype: int64

Row 0, calories column:
420

Rows 0 and 2, both columns:
   calories  durations
0       420         10
2       390         50


### 3.2.3 Loading A CSV File Into DataFrame & DF Options

- **CSV** (Comma-Separated Values) is a file format used to store tabular data in plain text, where each line represents a row and values are separated by commas
- use **pd.read_csv('filename.csv')** to load a CSV file into a DataFrame
- Pandas automatically detects column names from the first row of the CSV file
- You can specify custom options like delimiter, encoding, or which rows to skip when loading
- Options
    - **df.head(k)** - Returns the first k rows of the DataFrame (default is 5)
    - **df.tail(k)** - Returns the last k rows of the DataFrame (default is 5)
    - **df.info()** - Provides a quick summary of the DataFrame including number of entries, column data types, and non-null counts
    - **df.describe()** - Returns statistical analysis of the DataFrame for numerical columns
        - **count** - Number of non-null (non-missing) values in each column
        - **mean** - Average value of all entries in the column
        - **std (standard deviation)** - Measures on average how spread out the values are from the mean. Low std = values close to mean, High std = values spread far from mean
            - **Low**: std < 10-20% of the mean
            - **Medium**: std = 20-50% of the mean  
            - **High**: std > 50% of the mean
        - **min** - Minimum (smallest) value in the column
        - **25% (first quartile)** - Value below which 25% of the data falls
        - **50% (median/second quartile)** - Middle value when data is sorted; 50% of values are below this
        - **75% (third quartile)** - Value below which 75% of the data falls
        - **max** - Maximum (largest) value in the column
    - **print(df)** - Displays the entire DataFrame
    - **df.dropna()** - Removes all rows that contain any null/missing values (NaN) from the DataFrame. Returns a new DataFrame with clean data (no missing values)

In [None]:
# Basic CSV loading
df = pd.read_csv('../../Data/students.csv')

# View the entire dataframe
print("ENTIRE DATAFRAME")
print(df)
print()

# View first few rows (By default it returns 5)
print("TOP 5 ROWS OF DATAFRAME")
print(df.head()) # if i want to return top k rows then : df.head(k)
print()

# View Last few rows (By Default it returns 5)
print("LAST 5 ROWS OF DATAFRAME")
print(df.tail()) #if i want to return last K rows then : df.tail(k)
print()

# View Basic info about the dataframe
print("INFO OF DATAFRAME")
print(df.info()) # Basically gives me a quick summary on my DF, basically no of enteries and the datatypes of each column and non-null count
print()

# View Statistics of the DataFrame
print("STATISTICS OF DATAFRAME")
print(df.describe()) # returrns a statistical analysis on the data , count = count of non null values
print()  

# Drop Null
print("DROP NULL FROM DATAFRAME")
df=df.dropna()


# Common options
# df = pd.read_csv('../../Data/students.csv', 
#                  index_col=0,           # Use first column as index
#                  sep=';',               # Custom separator (for semicolon-separated files)
#                  encoding='utf-8',      # Specify encoding
#                  header=0,              # Row number to use as column names
#                  skiprows=[1, 2],       # Skip specific rows
#                  nrows=100)             # Read only first 100 rows

ENTIRE DATAFRAME
             Name  Age      Grade           City  GPA
0   Alice Johnson   20  Sophomore       New York  3.8
1       Bob Smith   22     Senior    Los Angeles  3.5
2   Charlie Brown   19   Freshman        Chicago  3.9
3    Diana Prince   21     Junior        Houston  3.7
4       Eve Davis   20  Sophomore        Phoenix  3.6
5    Frank Miller   23     Senior   Philadelphia  3.4
6       Grace Lee   19   Freshman    San Antonio  3.9
7    Henry Wilson   22     Junior      San Diego  3.5
8     Iris Taylor   20  Sophomore         Dallas  3.7
9      Jack Moore   21     Junior         Austin  3.8
10    Kevin White   19   Freshman   Jacksonville  3.6
11   Laura Harris   22     Senior     Fort Worth  3.8
12     Mike Clark   20  Sophomore       Columbus  3.5
13    Nancy Lewis   21     Junior      Charlotte  3.9
14   Oscar Walker   23     Senior  San Francisco  3.7

TOP 5 ROWS OF DATAFRAME
            Name  Age      Grade         City  GPA
0  Alice Johnson   20  Sophomore     New Yo

### 3.2.4 Adding Columns To DataFrame with Lambda Expression

- To create a new column based on another column, use **df['new_column'] = df['existing_column'].apply(function)**
- **apply()** applies a function to each value in the column, in most cases lambda function but you can also use a regular function
- **lambda** is an anonymous (unnamed) function used for simple, one-time operations
    - A regular function has a name and is defined separately, but lambda is defined inline (on the spot)
    - It's a shorthand way to write small functions without using `def`
    - Syntax: `lambda parameters: expression` (no `return` keyword needed, automatically returns the expression result)
    - Used when you need a simple function only once and don't want to define it separately, basically used for a single specifc thing
- Syntax: **lambda parameter: expression**
- The lambda function takes each value from the column, performs an operation, and returns the result
- **Lambda limitations:**
    - Must be a **single expression** that returns a value
    - Cannot contain multiple statements or complex logic
    - Cannot use `if/elif/else` statements (but can use ternary operator)
    - Cannot use loops, `return`, `pass`, `raise`, etc.


In [None]:
# Create a new column 'calorieLevel' based on 'Calories'
df['calorieLevel'] = df['Calories'].apply(
    lambda x: 'High' if x > 400 else 'Low'
)

# More complex example
df['calorieLevel'] = df['Calories'].apply(
    lambda x: 'High' if x > 400 else ('Medium' if x > 300 else 'Low')
)

# Using a regular function (alternative to lambda)
def categorize_calories(cal):
    if cal > 400:
        return 'High'
    elif cal > 300:
        return 'Medium'
    else:
        return 'Low'

# df['calorieLevel'] = df['Calories'].apply(categorize_calories)


# # Other apply examples
# df['doubled'] = df['Calories'].apply(lambda x: x * 2)
# df['rounded'] = df['GPA'].apply(lambda x: round(x, 1))
# df['upperName'] = df['Name'].apply(lambda x: x.upper())


In [None]:
# We can also apply a lambda function on entire rows so we can access multiple columns
# By default, axis=0 (column-wise), so when applying to a single column, you don't need to specify axis
# axis=1 means apply the function ACROSS each row (row-wise) 
# The lambda function receives the entire row, allowing us to access multiple columns
# Example: Create BMI column from height and weight
df['BMI'] = df.apply(lambda row: row['Weight'] / (row['Height'] ** 2), axis=1)

### 3.2.5 Replacing Values In Columns In DataFrame

- use **df['column'].replace(old_value, new_value)** to replace specific values in a column
- use **df.replace(old_value, new_value)** to replace values across the entire DataFrame
- **inplace=True** modifies the original DataFrame, otherwise it returns a new DataFrame
- Can replace single values, multiple values, or use a dictionary for mapping

In [None]:
# Replace a single value in a column
df['Grade'] = df['Grade'].replace('A+', 'A')

# Replace multiple values in a column
df['Status'] = df['Status'].replace(['pending', 'waiting'], 'in_progress')

# Replace using a dictionary (old: new)
df['Grade'] = df['Grade'].replace({'A+': 'A', 'B+': 'B', 'C+': 'C'})

# Replace across entire DataFrame
df = df.replace('N/A', None)

# Replace with inplace=True (modifies original)
df['City'].replace('NYC', 'New York', inplace=True)

# Replace NaN/null values with a specific value
df['Age'] = df['Age'].fillna(0)  # Fill missing values with 0

# Replace multiple values across entire DataFrame
df = df.replace({'Yes': 1, 'No': 0})


### 3.2.6 Filling N/A Values In DataFrame Columns
- **Single column syntax:** use **df['column'].fillna(value)** to replace null/missing (NaN) values with a specific value in one column
- **Multiple columns syntax:** use **df.fillna({'column1': value1, 'column2': value2})** with a dictionary to fill multiple columns at once with different values
- Common fillna strategies: fill with mean, median, mode, forward fill (ffill), backward fill (bfill), or a constant value
- **inplace=True** modifies the original DataFrame, otherwise it returns a new DataFrame and requires assignment

In [None]:
# Replace null/missing values with the mean of the column
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Replace null values with the median (alternative)
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Replace null values with mode (most frequent value)
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Using inplace=True (modifies the original DataFrame directly, no assignment needed)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Fill multiple columns at once using a dictionary
df.fillna({
    'Age': df['Age'].mean(),
    'Salary': df['Salary'].median(),
    'City': df['City'].mode()[0]
}, inplace=True)

# Or without inplace (assign back)
df = df.fillna({
    'Age': df['Age'].mean(),
    'Salary': df['Salary'].median(),
    'City': 'Unknown'
})

# Fill all null values in entire DataFrame with a single value
df.fillna(0, inplace=True)  # Replace all NaN with 0
df.fillna('N/A', inplace=True)  # Replace all NaN with 'N/A'

# Fill forward (use previous row's value)
df.fillna(method='ffill', inplace=True)

# Fill backward (use next row's value)
df.fillna(method='bfill', inplace=True)

### 3.2.7 Drop Columns In The DataFrame
- **Single column syntax:** use **df.drop(columns='column_name')** to remove one column from the DataFrame
- **Multiple columns syntax:** use **df.drop(columns=['column1', 'column2'])** to remove multiple columns at once using a list
- **inplace=True** modifies the original DataFrame, otherwise it returns a new DataFrame and requires assignment
- Alternative: use **del df['column_name']** to delete a column in-place (cannot be undone)

In [None]:
# Drop a single column
df = df.drop(columns='column_name')

# Drop multiple columns
df = df.drop(columns=['column1', 'column2', 'column3'])

# Using inplace=True (modifies original DataFrame, no assignment needed)
df.drop(columns='column_name', inplace=True)
df.drop(columns=['column1', 'column2'], inplace=True)

# Alternative: Using del (in-place, cannot be undone)
del df['column_name']

# Using pop() (removes and returns the column)
removed_column = df.pop('column_name')