# 6. Advanced Data Analysis
- Advanced data analysis involves sophisticated techniques to derive deeper insights from your data. These techniques often require more complex operations and optimizations to handle large datasets efficiently.

## Window Functions


- Window functions, also known as moving or rolling functions, allow for the application of calculations across a sliding window of data points. They are crucial for tasks such as smoothing time series data, calculating rolling averages, and understanding trends over specified intervals. Window functions in pandas include rolling and expanding calculations.

  - `Rolling Window Calculations: `Apply a function to a fixed-size window of data that moves across the dataset. Common functions include mean, sum, standard deviation, and more.

  -` Expanding Window Calculations: `Apply a function to an expanding window of data that grows as it progresses through the dataset. This is useful for cumulative calculations.


### Types of Window Functions:

- `Rolling Mean:` Computes the average over a moving window.
- `Rolling Sum:` Computes the sum over a moving window.
- `Rolling Standard Deviation:` Measures variability over a moving window.
- `Expanding Mean:` Computes the cumulative mean from the start to the current point.

In [None]:
import pandas as pd

# Sample data with daily frequency
data = {
    'Date': pd.date_range(start='2024-01-01', periods=10, freq='D'),
    'Value': [100, 110, 120, 130, 140, 150, 160, 170, 180, 190]
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

# Rolling mean with a window size of 3
df['Rolling_Mean'] = df['Value'].rolling(window=3).mean()
print(df)

# Expanding mean
df['Expanding_Mean'] = df['Value'].expanding().mean()
print(df)


## Data Aggregation


- Data aggregation involves summarizing data by applying functions like sum, mean, min, max, or custom functions to groups of data. This is particularly useful in exploratory data analysis to understand trends and patterns. Aggregation functions can be used in conjunction with groupby() to analyze data across different categories.

  - `Using agg():` Allows for the application of multiple aggregation functions to one or more columns.
  -`Using transform():` Applies a function to each group and returns a DataFrame with the same shape as the input.


### Types of Aggregation Functions:

- `Sum:` Total sum of values within each group.
- `Mean:` Average value within each group.
- `Min/Max:` Minimum or maximum value within each group.
- `Custom Functions:` User-defined aggregation functions.




Example:

In [None]:
import pandas as pd

# Sample data
data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Padhmavathi', 'Bhagath', 'Monika'],
    'Age': [25, 30, 35, 28, 25, 40],
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur', 'Bangalore', 'Hyderabad']
}
df = pd.DataFrame(data)

# Aggregation with groupby
agg_result = df.groupby('City').agg({
    'Age': ['mean', 'max', 'min'],
    'Name': 'count'
})
print(agg_result)

# Using transform to normalize Age within each city
df['Normalized_Age'] = df.groupby('City')['Age'].transform(lambda x: (x - x.mean()) / x.std())
print(df)


## Performance Optimization


- performance optimization focuses on improving the efficiency of data processing and reducing memory usage, which is crucial when working with large datasets. Efficient data handling ensures faster execution of operations and can significantly reduce computational costs.

  - `Efficient Data Handling:` Involves techniques like chunking, indexing, and optimizing data types. Chunking allows for processing large datasets in smaller segments. Indexing speeds up data retrieval operations. Choosing appropriate data types reduces memory consumption.
  - `Memory Management:` Involves techniques like type conversion (e.g., converting int64 to int32), dropping unnecessary columns, and using more memory-efficient data structures.


Example:

In [None]:
import pandas as pd
import numpy as np

# Create a large dataset
data = {
    'Name': np.random.choice(['Bhagath', 'Bharath', 'Monika', 'Padhmavathi'], size=1_000_000),
    'Age': np.random.randint(20, 60, size=1_000_000),
    'City': np.random.choice(['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur'], size=1_000_000)
}
df = pd.DataFrame(data)

# Optimize memory by converting data types
df['Age'] = df['Age'].astype(np.int32)
df['City'] = df['City'].astype('category')

# Efficient chunk processing
chunk_size = 100_000
chunks = pd.read_csv('large_dataset.csv', chunksize=chunk_size)
for chunk in chunks:
    process(chunk)  # Replace 'process' with actual processing function

print(df.info())
