# Data Analysis and Visualization with Python

This notebook demonstrates basic data analysis and visualization using pandas and matplotlib in Python. We'll walk through the process of loading, exploring, analyzing, and visualizing sample data.

## 1. Import Required Libraries

First, we need to import the necessary libraries for data manipulation and visualization:

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Configure Matplotlib for better visualization in notebooks
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

ModuleNotFoundError: No module named 'numpy'

## 2. Generate Sample Data

Let's create some sample data to work with. We'll generate a dataset that represents sales data across different regions over time:

In [None]:
# Set seed for reproducibility
np.random.seed(42)

# Create date range for the past year
date_range = pd.date_range(start='2024-08-01', end='2025-07-31', freq='D')

# Define regions
regions = ['North', 'South', 'East', 'West', 'Central']

# Create empty dataframe
data = []

# Generate data for each region
for region in regions:
    # Base sales amount varies by region
    base_sales = np.random.randint(100, 200)
    
    # Generate daily sales with seasonal patterns
    for date in date_range:
        # Add seasonality (higher in summer months, lower in winter)
        month = date.month
        seasonal_factor = 1.0 + 0.3 * np.sin((month - 1) * np.pi / 6)
        
        # Add weekday effect (weekends have higher sales)
        weekday = date.weekday()
        weekday_factor = 1.2 if weekday >= 5 else 1.0  # Weekend boost
        
        # Add some randomness
        noise = np.random.normal(1, 0.2)
        
        # Calculate sales
        sales = base_sales * seasonal_factor * weekday_factor * noise
        
        # Add to data
        data.append({
            'date': date,
            'region': region,
            'sales': round(sales, 2),
            'transactions': int(sales / np.random.uniform(5, 15))
        })

# Create DataFrame
df = pd.DataFrame(data)

# Display the first few rows
df.head()

## 3. Data Exploration

Let's explore our dataset to understand its structure and contents:

In [None]:
# Check the shape of our dataset
print(f"Dataset shape: {df.shape}")

# Check data types
print("\nData Types:")
print(df.dtypes)

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Get statistical summary
print("\nStatistical Summary:")
print(df.describe())

# Check unique regions
print("\nUnique Regions:", df['region'].unique())

## 4. Data Visualization

Now that we understand our data, let's create some visualizations to better analyze trends and patterns:

In [None]:
# Time Series Analysis - Sales over time by region
plt.figure(figsize=(14, 8))

# Group by date and region to get daily sales
daily_sales = df.groupby(['date', 'region'])['sales'].sum().reset_index()

# Plot time series for each region
for region in regions:
    region_data = daily_sales[daily_sales['region'] == region]
    plt.plot(region_data['date'], region_data['sales'], label=region)

plt.title('Daily Sales by Region', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Sales Amount', fontsize=12)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Monthly analysis
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.strftime('%b')

# Get monthly sales by region
monthly_sales = df.groupby(['month', 'month_name', 'region'])['sales'].sum().reset_index()

# Create a pivot table for easier plotting
pivot_data = monthly_sales.pivot(index=['month', 'month_name'], columns='region', values='sales')
pivot_data = pivot_data.sort_index(level=0)

# Plot monthly data
plt.figure(figsize=(14, 8))
pivot_data.plot(kind='bar', ax=plt.gca())
plt.title('Monthly Sales by Region', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Total Sales', fontsize=12)
plt.xticks(ticks=range(len(pivot_data)), labels=pivot_data.index.get_level_values('month_name'), rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.legend(title='Region')
plt.tight_layout()
plt.show()

In [None]:
# Region performance comparison
region_summary = df.groupby('region').agg({
    'sales': ['sum', 'mean', 'std'],
    'transactions': ['sum', 'mean']
}).reset_index()

# Flatten the multi-index columns
region_summary.columns = ['_'.join(col).strip('_') for col in region_summary.columns.values]

# Plot total sales by region
plt.figure(figsize=(12, 6))
sns.barplot(x='region', y='sales_sum', data=region_summary, palette='viridis')
plt.title('Total Sales by Region', fontsize=16)
plt.xlabel('Region', fontsize=12)
plt.ylabel('Total Sales', fontsize=12)
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on top of bars
for i, v in enumerate(region_summary['sales_sum']):
    plt.text(i, v + 1000, f'{int(v):,}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

## 5. Data Analysis

Let's perform some more advanced analysis on our dataset:

In [None]:
# Add some derived features
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)
df['avg_transaction_value'] = df['sales'] / df['transactions']

# Let's analyze correlation between different variables
correlation_vars = df[['sales', 'transactions', 'day_of_week', 'is_weekend', 'month', 'avg_transaction_value']]
correlation = correlation_vars.corr()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.2f')
plt.title('Correlation Matrix', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Weekend vs Weekday Analysis
weekday_data = df.groupby(['is_weekend', 'region']).agg({
    'sales': ['mean', 'sum'],
    'transactions': ['mean', 'sum']
}).reset_index()

# Flatten multi-level columns
weekday_data.columns = ['_'.join(col).strip('_') for col in weekday_data.columns.values]

# Convert is_weekend to categorical for better labels
weekday_data['day_type'] = weekday_data['is_weekend'].apply(lambda x: 'Weekend' if x == 1 else 'Weekday')

# Plot average sales by day type and region
plt.figure(figsize=(14, 7))
sns.barplot(x='region', y='sales_mean', hue='day_type', data=weekday_data, palette='Set2')
plt.title('Average Daily Sales: Weekday vs Weekend', fontsize=16)
plt.xlabel('Region', fontsize=12)
plt.ylabel('Average Daily Sales', fontsize=12)
plt.grid(True, alpha=0.3, axis='y')
plt.legend(title='Day Type')
plt.tight_layout()
plt.show()

## 6. Conclusion

In this notebook, we've demonstrated how to:

1. Generate and manipulate sample data using pandas
2. Explore data characteristics and structure
3. Create various visualizations using matplotlib and seaborn
4. Perform basic data analysis to extract insights

This notebook can serve as a starting point for more complex data analysis tasks. You can extend it by:

- Adding more advanced statistical analysis
- Building predictive models
- Creating interactive visualizations
- Working with real-world datasets from various sources