# Assignment 1:
### - The automated_stat_analyzer Function
- Scenario: A retail company needs a utility to quickly summarize sales data. Students must create a function that identifies the 
"Central Tendency" and "Dispersion" of any numerical column.
- ### Requirements:

* Accept a Pandas DataFrame and a column name.

* Calculate the Mean, Median, and Standard Deviation .

* Identify if the data is "Skewed" by comparing the Mean and Median.


* Bonus: If the column is categorical, return the Mode instead.

### Your Data

In [30]:
import pandas as pd
import numpy as np

# Create a synthetic Company Sales Dataset
data = {
    'Transaction_ID': range(1, 11),
    'Product_Category': ['Electronics', 'Home', 'Electronics', 'Sports', 'Home', 
                        'Electronics', 'Home', 'Sports', 'Electronics', 'Electronics'],
    'Sales_Amount': [150, 200, 155, 300, 210, 180, 205, 1000, 190, 160], # 1000 is an Outlier
    'Customer_Age': [25, 34, np.nan, 45, 23, 31, 29, np.nan, 38, 40],    # Contains Nulls (NaN)
    'Rating': [5, 4, 3, 5, 2, 4, 5, 2, 4, 3]
}

df_test = pd.DataFrame(data)

# Save to CSV for students to practice loading files [cite: 74]
df_test.to_csv('company_sales_test.csv', index=False)
print("Test dataset created successfully!")

Test dataset created successfully!


In [31]:
df_test.head()

Unnamed: 0,Transaction_ID,Product_Category,Sales_Amount,Customer_Age,Rating
0,1,Electronics,150,25.0,5
1,2,Home,200,34.0,4
2,3,Electronics,155,,3
3,4,Sports,300,45.0,5
4,5,Home,210,23.0,2


In [32]:
import pandas as pd

def automated_stat_analyzer(df, column_name):
    """
    Company Task: Provide a summary report of a specific data variable.
    
    Instructions:
    1. Check if the column is numerical or categorical.
    2. For numerical: Calculate Mean, Median, and Standard Deviation.
    3. For categorical: Calculate the Mode.
    4. Return a dictionary with these statistical measures.
    """
    # TODO: Implement using df[column_name].mean(), .median(), .std(), or .mode()  you can used Sales_Amount for your test case.
    column = df[column_name]    
    column_clean = column.dropna()
    
    result = {
        "Column": column_name,
        "Type": None
    }
    
    if pd.api.types.is_numeric_dtype(column_clean):
        mean_val = column_clean.mean()
        median_val = column_clean.median()
        std_val = column_clean.std()
        
        if mean_val > median_val:
            skewness = "Right Skewed"
        elif mean_val < median_val:
            skewness = "Left Skewed"
        else:
            skewness = "Symmetric"
        
        result.update({
            "Type": "Numerical",
            "Mean":float(round(mean_val, 2)) ,
            "Median": float(median_val),
            "Standard Deviation": float(round(std_val, 2)),
            "Skewness": skewness
        })  
        
    else:
        mode_val = column_clean.mode()[0]
        result.update({
            "Type": "Categorical",
            "Mode": mode_val
        })
    
    return result

automated_stat_analyzer(df_test, "Sales_Amount")






{'Column': 'Sales_Amount',
 'Type': 'Numerical',
 'Mean': 275.0,
 'Median': 195.0,
 'Standard Deviation': 258.31,
 'Skewness': 'Right Skewed'}

## Assignment 2: 
  ### The null_handling_strategy Function


#### Scenario: Incoming user data often has missing values.Students must implement a flexible strategy to handle these "Null Values" to prepare data for Machine Learning.
### Requirements:

* Check for null values in the DataFrame.

* Apply a strategy based on parameters: "drop_rows", "fill_mean", or "fill_median" .

* Ensure the function only fills numerical columns when using mean or median.

In [33]:
def null_handling_strategy(df, strategy="fill_mean"):
    """
    Company Task: Clean a dataset by resolving missing (NaN) values.
    """
    # TODO: Implement using .isnull(), .dropna(), or .fillna() you can used Customer_Age for your test case
    
    null_count_before = df.isnull().sum().sum()
    
    if strategy == "drop_rows":
        cleaned_df = df.dropna()
        
    elif strategy == "fill_mean":
        cleaned_df = df.copy()
        numeric_cols = cleaned_df.select_dtypes(include=np.number).columns
        cleaned_df[numeric_cols] = cleaned_df[numeric_cols].fillna(
            cleaned_df[numeric_cols].mean()
        )
    
    elif strategy == "fill_median":
        cleaned_df = df.copy()
        numeric_cols = cleaned_df.select_dtypes(include=np.number).columns
        cleaned_df[numeric_cols] = cleaned_df[numeric_cols].fillna(
            cleaned_df[numeric_cols].median()
        )
    
    else:
        raise ValueError("Invalid strategy. Choose: 'drop_rows', 'fill_mean', or 'fill_median'")
    
    null_count_after = cleaned_df.isnull().sum().sum()
    
    print(f"Missing values before: {null_count_before}")
    print(f"Missing values after: {null_count_after}")
    
    return cleaned_df


In [34]:
automated_stat_analyzer(df_test, "Customer_Age")

{'Column': 'Customer_Age',
 'Type': 'Numerical',
 'Mean': 33.12,
 'Median': 32.5,
 'Standard Deviation': 7.59,
 'Skewness': 'Right Skewed'}

In [35]:
df_test_cleaned = null_handling_strategy(df_test, strategy="fill_mean")
df_test_cleaned["Customer_Age"]

Missing values before: 2
Missing values after: 0


0    25.000
1    34.000
2    33.125
3    45.000
4    23.000
5    31.000
6    29.000
7    33.125
8    38.000
9    40.000
Name: Customer_Age, dtype: float64

In [36]:
df_test_cleaned = null_handling_strategy(df_test, strategy="fill_median")
df_test_cleaned["Customer_Age"]

Missing values before: 2
Missing values after: 0


0    25.0
1    34.0
2    32.5
3    45.0
4    23.0
5    31.0
6    29.0
7    32.5
8    38.0
9    40.0
Name: Customer_Age, dtype: float64

In [37]:
df_test_cleaned = null_handling_strategy(df_test, strategy="drop_rows")
df_test_cleaned["Customer_Age"]

Missing values before: 2
Missing values after: 0


0    25.0
1    34.0
3    45.0
4    23.0
5    31.0
6    29.0
8    38.0
9    40.0
Name: Customer_Age, dtype: float64

In [38]:
from analyzer import automated_stat_analyzer 
automated_stat_analyzer(df_test, "Sales_Amount")

{'Column': 'Sales_Amount',
 'Type': 'Numerical',
 'Mean': 275.0,
 'Median': 195.0,
 'Standard Deviation': 258.31,
 'Skewness': 'Right Skewed'}

In [39]:
from null_handle import null_handling_strategy
df_test_cleaned = null_handling_strategy(df_test, strategy="drop_rows")
df_test_cleaned["Customer_Age"]


ImportError: cannot import name 'pd' from 'pandas' (C:\Users\arbs\AppData\Roaming\Python\Python312\site-packages\pandas\__init__.py)