# Assignment 1:
### - The automated_stat_analyzer Function
- Scenario: A retail company needs a utility to quickly summarize sales data. Students must create a function that identifies the 
"Central Tendency" and "Dispersion" of any numerical column.
- ### Requirements:

* Accept a Pandas DataFrame and a column name.

* Calculate the Mean, Median, and Standard Deviation .

* Identify if the data is "Skewed" by comparing the Mean and Median.


* Bonus: If the column is categorical, return the Mode instead.

### Your Data

In [2]:
import pandas as pd
import numpy as np

# Create a synthetic Company Sales Dataset
data = {
    'Transaction_ID': range(1, 11),
    'Product_Category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home', 
                         'Electronics', 'Home', 'Sports', 'Electronics', 'Electronics'],
    'Sales_Amount': [150, 200, 155, 300, 210, 180, 205, 1000, 190, 160], # 1000 is an Outlier
    'Customer_Age': [25, 34, np.nan, 45, 23, 31, 29, np.nan, 38, 40],    # Contains Nulls (NaN)
    'Rating': [5, 4, 3, 5, 2, 4, 5, 2, 4, 3]
}

df_test = pd.DataFrame(data)

# Save to CSV for students to practice loading files [cite: 74]
df_test.to_csv('company_sales_test.csv', index=False)
print("Test dataset created successfully!")

Test dataset created successfully!


In [3]:
df_test.head()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2


In [4]:
import pandas as pd

def automated_stat_analyzer(df, column_name):
    """
    Company Task: Provide a summary report of a specific data variable.
    """
    # 1. Check if the column is numerical or categorical
    # We use 'is_numeric_dtype' to distinguish between numbers and text
    if pd.api.types.is_numeric_dtype(df[column_name]):
        
        # 2. For numerical: Calculate Mean, Median, and Standard Deviation
        stats = {
            "Mean": df[column_name].mean(),
            "Median": df[column_name].median(),
            "Standard Deviation": df[column_name].std(),
            "Type": "Numerical"
        }
    else:
        # 3. For categorical: Calculate the Mode
        # .mode() returns a Series, so we take the first element [0]
        stats = {
            "Mode": df[column_name].mode()[0],
            "Type": "Categorical"
        }
        
    # 4. Return a dictionary with these statistical measures
    return stats

# Test Case: Using 'Sales_Amount' as requested in the TODO
print("Testing with Sales_Amount:")
report = automated_stat_analyzer(df_test, 'Sales_Amount')
print(report)

# Test Case: Using 'Product_Category' to check categorical logic
print("\nTesting with Product_Category:")
print(automated_stat_analyzer(df_test, 'Product_Category'))

Testing with Sales_Amount:
{'Mean': np.float64(275.0), 'Median': np.float64(195.0), 'Standard Deviation': np.float64(258.30645021412494), 'Type': 'Numerical'}

Testing with Product_Category:
{'Mode': 'Electronics', 'Type': 'Categorical'}


## Assignment 2: 
  ### The null_handling_strategy Function


#### Scenario: Incoming user data often has missing values.Students must implement a flexible strategy to handle these "Null Values" to prepare data for Machine Learning.
### Requirements:

* Check for null values in the DataFrame.

* Apply a strategy based on parameters: "drop_rows", "fill_mean", or "fill_median" .

* Ensure the function only fills numerical columns when using mean or median.

In [5]:
def null_handling_strategy(df, strategy="fill_mean"):
    """
    Company Task: Clean a dataset by resolving missing (NaN) values.
    """
    # TODO: Implement using .isnull(), .dropna(), or .fillna() you can used Customer_Age for your test case
    pass

In [6]:
import pandas as pd

def null_handling_strategy(df, strategy="fill_mean"):
    """
    Cleans a dataset by resolving missing (NaN) values using specific strategies.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The input dataset containing potential null values.
    strategy : str, default "fill_mean"
        The strategy to handle missing data. Options:
        - "drop_rows": Removes any row with a missing value.
        - "fill_mean": Fills missing values in numerical columns with the mean.
        - "fill_median": Fills missing values in numerical columns with the median.
        
    Returns:
    --------
    pandas.DataFrame
        A cleaned copy of the original DataFrame.
    """
    
    # 1. Create a copy of the DataFrame to keep the original data safe
    df_cleaned = df.copy()

    # 2. Strategy: Drop rows with any missing values
    if strategy == "drop_rows":
        df_cleaned = df_cleaned.dropna()
        
    # 3. Strategy: Fill missing values with Mean or Median
    elif strategy in ["fill_mean", "fill_median"]:
        # Iterate through columns to find numerical ones
        for col in df_cleaned.columns:
            # Check if the column is numerical (Required)
            if pd.api.types.is_numeric_dtype(df_cleaned[col]):
                if strategy == "fill_mean":
                    fill_value = df_cleaned[col].mean()
                else:
                    fill_value = df_cleaned[col].median()
                
                # Fill NaNs in the current column
                df_cleaned[col] = df_cleaned[col].fillna(fill_value)
    
    return df_cleaned

# --- Test Case Execution ---
# Testing the function on 'Customer_Age' (which has NaNs in your df_test)
print("Handling Nulls using Mean Strategy:")
cleaned_df = null_handling_strategy(df_test, strategy="fill_mean")

# Displaying the result for verification
print(cleaned_df[['Customer_Age', 'Sales_Amount']].head())

Handling Nulls using Mean Strategy:
   Customer_Age  Sales_Amount
0        25.000           150
1        34.000           200
2        33.125           155
3        45.000           300
4        23.000           210
