# Assignment 1:
### - The automated_stat_analyzer Function
- Scenario: A retail company needs a utility to quickly summarize sales data. Students must create a function that identifies the 
"Central Tendency" and "Dispersion" of any numerical column.
- ### Requirements:

* Accept a Pandas DataFrame and a column name.

* Calculate the Mean, Median, and Standard Deviation .

* Identify if the data is "Skewed" by comparing the Mean and Median.


* Bonus: If the column is categorical, return the Mode instead.

### Your Data

In [1]:
import pandas as pd
import numpy as np

# Create a synthetic Company Sales Dataset
data = {
    'Transaction_ID': range(1, 11),
    'Product_Category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home', 
                        'Electronics', 'Home', 'Sports', 'Electronics', 'Electronics'],
    'Sales_Amount': [150, 200, 155, 300, 210, 180, 205, 1000, 190, 160], # 1000 is an Outlier
    'Customer_Age': [25, 34, np.nan, 45, 23, 31, 29, np.nan, 38, 40],    # Contains Nulls (NaN)
    'Rating': [5, 4, 3, 5, 2, 4, 5, 2, 4, 3]
}

df_test = pd.DataFrame(data)

# Save to CSV for students to practice loading files [cite: 74]
df_test.to_csv('company_sales_test.csv', index=False)
print("Test dataset created successfully!")

Test dataset created successfully!


In [3]:
df_test.head()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2


In [4]:
def automated_stat_analyzer(df, column_name):
    """
    Company Task: Provide a summary report of a specific data variable.
    
    Instructions:
    1. Check if the column is numerical or categorical.
    2. For numerical: Calculate Mean, Median, and Standard Deviation.
    3. For categorical: Calculate the Mode.
    4. Return a dictionary with these statistical measures.
    """
    if column_name not in df.columns:
        raise ValueError(f"column {column_name} not found in DataFrame")
    
    column = df[column_name].dropna()

    if pd.api.types.is_numeric_dtype(column):
        mean = column.mean()
        median = column.median()
        std_dev = column.std()


        if mean > median:
            skewness = "Right Skewed"
        elif mean < median:
            skewness = "Left Skewed"
        else:
            skewness = "Symmetric"

        return {
            "type": "Numerical",
            "mean": float(mean),
            "median": float(median),
            "standard_deviation": float(std_dev),
            "skewness": skewness
        }


    else:
        mode = column.mode()

        return {
            "type": "Categorical",
            "mode": mode.iloc[0]
        }
    
# automated_stat_analyzer(df_test , 'Sales_Amount')
automated_stat_analyzer(df_test , 'Product_Category')

{'type': 'Categorical', 'mode': 'Electronics'}

## Assignment 2: 
  ### The null_handling_strategy Function


#### Scenario: Incoming user data often has missing values.Students must implement a flexible strategy to handle these "Null Values" to prepare data for Machine Learning.
### Requirements:

* Check for null values in the DataFrame.

* Apply a strategy based on parameters: "drop_rows", "fill_mean", or "fill_median" .

* Ensure the function only fills numerical columns when using mean or median.

In [44]:
def null_handling_strategy(df , strategy="fill_mean"):

    df_cleaned = df.copy()

    if strategy == "drop_rows" :
        df_cleaned = df_cleaned.dropna()


    elif strategy == "fill_mean" :
        numeric_cols = df_cleaned.select_dtypes(include='number')
        mean = numeric_cols.mean()
        df_cleaned[numeric_cols.columns] = df_cleaned[numeric_cols.columns].fillna(mean)


    elif strategy == "fill_median" :
        numeric_cols = df_cleaned.select_dtypes(include='number')
        median = numeric_cols.median()
        df_cleaned[numeric_cols.columns] = df_cleaned[numeric_cols.columns].fillna(median)
    

    else:
        raise ValueError("Invalid strategy. Use 'drop_rows', 'fill_mean', or 'fill_median'.")


    return df_cleaned


In [48]:
null_handling_strategy(df_test , strategy="fill_median")

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,32.5,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2
5,6,Electronics,180,31.0,4
6,7,Home,205,29.0,5
7,8,Sports,1000,32.5,2
8,9,Electronics,190,38.0,4
9,10,Electronics,160,40.0,3
