# Assignment 1:
### - The automated_stat_analyzer Function
- Scenario: A retail company needs a utility to quickly summarize sales data. Students must create a function that identifies the 
"Central Tendency" and "Dispersion" of any numerical column.
- ### Requirements:

* Accept a Pandas DataFrame and a column name.

* Calculate the Mean, Median, and Standard Deviation .

* Identify if the data is "Skewed" by comparing the Mean and Median.


* Bonus: If the column is categorical, return the Mode instead.

### Your Data

In [2]:
import pandas as pd
import numpy as np

# Create a synthetic Company Sales Dataset
data = {
    'Transaction_ID': range(1, 11),
    'Product_Category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home', 
                         'Electronics', 'Home', 'Sports', 'Electronics', 'Electronics'],
    'Sales_Amount': [150, 200, 155, 300, 210, 180, 205, 1000, 190, 160], # 1000 is an Outlier
    'Customer_Age': [25, 34, np.nan, 45, 23, 31, 29, np.nan, 38, 40],    # Contains Nulls (NaN)
    'Rating': [5, 4, 3, 5, 2, 4, 5, 2, 4, 3]
}

df_test = pd.DataFrame(data)

# Save to CSV for students to practice loading files [cite: 74]
df_test.to_csv('company_sales_test.csv', index=False)
print("Test dataset created successfully!")

Test dataset created successfully!


In [3]:
df_test.head()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2


In [4]:
import pandas as pd

def automated_stat_analyzer(df, column_name):
    """
    Company Task: Provide a summary report of a specific data variable.
    
    Instructions:
    1. Check if the column is numerical or categorical.
    2. For numerical: Calculate Mean, Median, and Standard Deviation.
    3. For categorical: Calculate the Mode.
    4. Return a dictionary with these statistical measures.

    :param df: pandas DataFrame containing the dataset
    :type df: pandas.DataFrame
    :param column_name: The name of the column to analyze
    :type column_name: str

    :return: A dictionary with the calculated statistics    
    :rtype: dict
    """

    column = df[column_name]

    # Case 1: Numerical column
    if pd.api.types.is_numeric_dtype(column):

        # Handle missing values
        column_cleaned = column.fillna(column.median())

        mean_value = column_cleaned.mean()
        median_value = column_cleaned.median()
        std_value = column_cleaned.std()

        # Determine skewness (logical comparison, not mathematical skew)
        if mean_value > median_value:
            skewness = "Data is skewed (Mean > Median)"
        elif mean_value < median_value:
            skewness = "Data is skewed (Mean < Median)"
        else:
            skewness = "Data is approximately balanced"

        result = {
            "Mean": mean_value,
            "Median": median_value,
            "Standard Deviation": std_value,
            "Skewness": skewness
        }

    # Case 2: Categorical column
    elif pd.api.types.is_object_dtype(column) or pd.api.types.is_categorical_dtype(column):

        column_cleaned = column.dropna()

        result = {
            "Mode": column_cleaned.mode().iloc[0]
        }

    else:
        result = {
            "Message": f"Unsupported data type: {column.dtype}"
        }

    return result


test = automated_stat_analyzer(df_test,"Sales_Amount")
print(test)
test1=automated_stat_analyzer(df_test,"Product_Category")
print(test1)


{'Mean': np.float64(275.0), 'Median': np.float64(195.0), 'Standard Deviation': np.float64(258.30645021412494), 'Skewness': 'Data is skewed (Mean > Median)'}
{'Message': 'Unsupported data type: str'}


  elif pd.api.types.is_object_dtype(column) or pd.api.types.is_categorical_dtype(column):


## Assignment 2: 
  ### The null_handling_strategy Function


#### Scenario: Incoming user data often has missing values.Students must implement a flexible strategy to handle these "Null Values" to prepare data for Machine Learning.
### Requirements:

* Check for null values in the DataFrame.

* Apply a strategy based on parameters: "drop_rows", "fill_mean", or "fill_median" .

* Ensure the function only fills numerical columns when using mean or median.

In [5]:
def null_handling_strategy(df, strategy="fill_mean"):
    """
    Company Task: Clean a dataset by resolving missing (NaN) values.

    :param df: The pandas DataFrame to process.
    :type df: pandas.DataFrame
    :param strategy: The strategy to use for handling null values ("drop", "fill_mean", "fill_median", "fill_mode").
    :type strategy: str

    :return: The cleaned DataFrame.
    :rtype: pandas.DataFrame

    """
    if strategy == "drop":
        return df.dropna()
    elif strategy == "fill_mean":
        return df.fillna(df.mean(numeric_only=True))
    elif strategy == "fill_median":
        return df.fillna(df.median(numeric_only=True))
    elif strategy == "fill_mode":
        return df.fillna(df.mode().iloc[0])
    else:
        raise ValueError("Invalid strategy. Choose from 'drop', 'fill_mean', 'fill_median', 'fill_mode'.")
    


cleaned_df = null_handling_strategy(df_test, strategy="fill_mean")
print(cleaned_df)

   Transaction_ID Product_Category  Sales_Amount  Customer_Age  Rating
0               1      Electronics           150        25.000       5
1               2             Home           200        34.000       4
2               3      Electronics           155        33.125       3
3               4           Sports           300        45.000       5
4               5             Home           210        23.000       2
5               6      Electronics           180        31.000       4
6               7             Home           205        29.000       5
7               8           Sports          1000        33.125       2
8               9      Electronics           190        38.000       4
9              10      Electronics           160        40.000       3
