# Assignment 1:
### - The automated_stat_analyzer Function
- Scenario: A retail company needs a utility to quickly summarize sales data. Students must create a function that identifies the 
"Central Tendency" and "Dispersion" of any numerical column.
- ### Requirements:

* Accept a Pandas DataFrame and a column name.

* Calculate the Mean, Median, and Standard Deviation .

* Identify if the data is "Skewed" by comparing the Mean and Median.


* Bonus: If the column is categorical, return the Mode instead.

### Your Data

In [1]:
import pandas as pd
import numpy as np

# Create a synthetic Company Sales Dataset
data = {
    'Transaction_ID': range(1, 11),
    'Product_Category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home', 
                         'Electronics', 'Home', 'Sports', 'Electronics', 'Electronics'],
    'Sales_Amount': [150, 200, 155, 300, 210, 180, 205, 1000, 190, 160], # 1000 is an Outlier
    'Customer_Age': [25, 34, np.nan, 45, 23, 31, 29, np.nan, 38, 40],    # Contains Nulls (NaN)
    'Rating': [5, 4, 3, 5, 2, 4, 5, 2, 4, 3]
}

df_test = pd.DataFrame(data)

# Save to CSV for students to practice loading files [cite: 74]
df_test.to_csv('company_sales_test.csv', index=False)
print("Test dataset created successfully!")

Test dataset created successfully!


In [2]:
df_test.head()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2


In [12]:
import pandas as pd

def automated_stat_analyzer(df_test, column_name):
    """
    Company Task: Provide a summary report of a specific data variable.
    
    Instructions:
    1. Check if the column is numerical or categorical.
    2. For numerical: Calculate Mean, Median, and Standard Deviation.
    3. For categorical: Calculate the Mode.
    4. Return a dictionary with these statistical measures.
    """
    # Get the column data
    column_data = df_test[column_name]
    
    # Check if the column is numerical
    if pd.api.types.is_numeric_dtype(column_data):
        # For numerical columns
        result = {
            "Mean": column_data.mean(),
            "Median": column_data.median(),
            "Standard Deviation": column_data.std()
        }
    else:
        # For categorical columns
        # mode() returns a Series, so we take the first value
        mode_value = column_data.mode()
        result = {
            "Mode": mode_value[0] if not mode_value.empty else None
        }
    
    return result

## Assignment 2: 
  ### The null_handling_strategy Function


In [13]:
automated_stat_analyzer(df_test,'Sales_Amount')

{'Mean': np.float64(275.0),
 'Median': np.float64(195.0),
 'Standard Deviation': np.float64(258.30645021412494)}

#### Scenario: Incoming user data often has missing values.Students must implement a flexible strategy to handle these "Null Values" to prepare data for Machine Learning.
### Requirements:

* Check for null values in the DataFrame.

* Apply a strategy based on parameters: "drop_rows", "fill_mean", or "fill_median" .

* Ensure the function only fills numerical columns when using mean or median.

In [15]:
def null_handling_strategy(df_test, strategy="fill_mean"):
    """
    Company Task: Clean a dataset by resolving missing (NaN) values.
    """
    # Make a copy of the dataframe to avoid modifying the original
    df_clean = df.copy()
    
    # Check for null values
    null_count = df_clean.isnull().sum().sum()
    print(f"Total null values found: {null_count}")
    
    if strategy == "drop_rows":
        # Drop all rows with any null values
        df_clean = df_clean.dropna()
        
    elif strategy == "fill_mean":
        # Fill numerical columns with mean
        numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if df_clean[col].isnull().any():
                df_clean[col] = df_clean[col].fillna(df_clean[col].mean())
                
    elif strategy == "fill_median":
        # Fill numerical columns with median
        numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if df_clean[col].isnull().any():
                df_clean[col] = df_clean[col].fillna(df_clean[col].median())
    
    # Return the cleaned dataframe
    return df_clean

In [17]:
# Create a sample dataframe with missing values
data = {
    'Customer_Age': [25, None, 30, None, 35, 28, 40],
    'Sales_Amount': [100, 200, None, 300, 250, None, 350],
    'Category': ['A', 'B', 'A', None, 'B', 'A', 'A']
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("\nMissing values per column:")
print(df.isnull().sum())


Original DataFrame:
   Customer_Age  Sales_Amount Category
0          25.0         100.0        A
1           NaN         200.0        B
2          30.0           NaN        A
3           NaN         300.0     None
4          35.0         250.0        B
5          28.0           NaN        A
6          40.0         350.0        A

Missing values per column:
Customer_Age    2
Sales_Amount    2
Category        1
dtype: int64
