# Project 1

## General Project Information

Dataset: Employee_Complete_Dataset.csv  
Source: Kaggle – Cassandra Systems Employee Dataset (https://www.kaggle.com/datasets/rockyt07/cassandra-employee-dataset)  
Description: This dataset contains information on 50,000 employees at Cassandra Systems Inc. It includes demographic and professional data that can be used for exploratory data analysis and other analytical purposes.  
Rows: 50,000  
Chosen numeric column: Employee_age  

Objective:  
 The goal is to compute the mean, median, and mode of a numeric column first with pandas (Step 5) and then manually with core Python (Step 6). Finally, Step 7 creates an ASCII-based visualization using only the Python standard library.

Key deliverables:  
- Demonstrate the ability to process and analyze data with and without pandas  
- Produce consistent, verified results between both methods  
- Present clear explanations and readable output for each step

This notebook follows a concise, step-by-step style so that each section can be understood independently.


In [None]:
# Configuration
DATA_PATH = "Employee_Complete_Dataset.csv"
NUMERIC_COLUMN = "Employee_age"
GROUP_COL = None
MAX_ROWS = None

# Step 5 — Using pandas


Read the dataset, select the column Employee_age, and calculate the mean, median, and mode.

This step uses pandas for efficient data handling and statistical analysis. The results serve as a reference for later manual calculations.

steps:  
1. Read the CSV file using pd.read_csv.  
2. Convert the target column to numeric using to_numeric(errors="coerce") to handle invalid values.  
3. Use built-in pandas methods mean(), median(), and mode() to compute the results.

The output show the number of rows, number of valid numeric values, and the three computed statistics.


In [None]:
import pandas as pd

def step5_with_pandas(path, value_col, max_rows=None):
    read_kwargs = {}
    if max_rows is not None:
        read_kwargs["nrows"] = int(max_rows)
    df = pd.read_csv(path, **read_kwargs)
    if value_col not in df.columns:
        raise ValueError(f"Column '{value_col}' not found in the CSV header.")
    s = pd.to_numeric(df[value_col], errors="coerce").dropna()
    mean_pd = s.mean()
    median_pd = s.median()
    modes = s.mode()
    mode_pd_value = modes.iloc[0] if len(modes) > 0 else None
    result = {
        "df": df,
        "series": s,
        "mean": float(mean_pd),
        "median": float(median_pd),
        "mode": float(mode_pd_value) if mode_pd_value is not None else None,
    }
    return result

print("Step 5: pandas")
p5 = step5_with_pandas(DATA_PATH, NUMERIC_COLUMN, MAX_ROWS)
print("Rows loaded:", len(p5["df"]))
print("Numeric values used:", len(p5["series"]))
print("Mean:", p5["mean"])
print("Median:", p5["median"])
print("Mode:", p5["mode"])


Step 5: pandas
Rows loaded: 50000
Numeric values used: 50000
Mean: 34.7048
Median: 35.0
Mode: 18.0


# Step 6 — The Hard Way


Recalculate the same statistics without using pandas or any external libraries.

This confirms understanding of how basic statistical operations are performed algorithmically and validates the accuracy of pandas results.

steps:  
1. Use csv.reader to read the file manually.  
2. Extract the numeric column and ignore invalid or missing values.  
3. Implement simple Python functions to compute mean, median, and mode.  

The results from this section match those from Step 5 within small rounding differences.


In [None]:
import csv
from math import isfinite

def iter_numeric_from_csv_stdlib(path, numeric_col, max_rows=None):
    values = []
    with open(path, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        header = next(reader)
        header = [h.strip() for h in header]
        idx = header.index(numeric_col)
        for i, row in enumerate(reader):
            if max_rows is not None and i >= max_rows:
                break
            try:
                v = float(row[idx])
                if isfinite(v):
                    values.append(v)
            except:
                pass
    return values

def mean_py(nums):
    return sum(nums) / len(nums) if nums else None

def median_py(nums):
    if not nums:
        return None
    arr = sorted(nums)
    n = len(arr)
    mid = n // 2
    return arr[mid] if n % 2 else (arr[mid-1] + arr[mid]) / 2

def mode_py(nums):
    counts = {}
    for x in nums:
        counts[x] = counts.get(x, 0) + 1
    return max(counts, key=counts.get)

print("Step 6: Hard Way")
values = iter_numeric_from_csv_stdlib(DATA_PATH, NUMERIC_COLUMN, MAX_ROWS)
mean_p = mean_py(values)
median_p = median_py(values)
mode_p = mode_py(values)
print("Values used:", len(values))
print("Mean:", mean_p)
print("Median:", median_p)
print("Mode:", mode_p)


Step 6: Hard Way
Values used: 50000
Mean: 34.7048
Median: 35.0
Mode: 18.0


## Step 7 — Visualization

Create a simple ASCII visualization of the data using only Python’s built-in features.

This satisfies the requirement to visualize results without using any external plotting libraries and shows a text-based way to represent numerical data.
  
Scale each value to a bar length between 1 and 40 characters and print the first 25 rows as a simple bar chart.

Each line shows the row label, the bar, and the corresponding numeric value.


In [None]:
def ascii_bar_chart(labels, numbers, title=None, max_symbols=40, symbol="*"):
    pairs = []
    for l, n in zip(labels, numbers):
        try:
            n = float(n)
            if n == n and isfinite(n):
                pairs.append((str(l), n))
        except:
            pass
    if not pairs:
        print("No data to plot.")
        return
    vals = [n for _, n in pairs]
    vmin, vmax = min(vals), max(vals)
    span = (vmax - vmin) or 1.0
    if title:
        print(title)
        print("-" * len(title))
    for lab, val in pairs:
        k = max(1, int(round((val - vmin) / span * max_symbols)))
        bar = symbol * k
        print(f"{lab:>8}: {bar} ({val:.2f})")

print("Step 7: Visualization")
N = 25
labels_A = [f"row{i+1}" for i in range(min(N, len(values)))]
numbers_A = values[:N]
ascii_bar_chart(labels_A, numbers_A, title=f"{NUMERIC_COLUMN} (first {len(numbers_A)} rows)")


Step 7: Visualization
Employee_age (first 25 rows)
----------------------------
    row1: *************** (36.00)
    row2: ************* (34.00)
    row3: *************** (36.00)
    row4: ********************** (42.00)
    row5: ****** (27.00)
    row6: ***************** (38.00)
    row7: ******** (29.00)
    row8: **************** (37.00)
    row9: ********************** (42.00)
   row10: ***** (26.00)
   row11: *************** (36.00)
   row12: *** (24.00)
   row13: *** (24.00)
   row14: ***** (26.00)
   row15: ******************* (40.00)
   row16: ********************************** (54.00)
   row17: * (21.00)
   row18: ************ (33.00)
   row19: *************************** (47.00)
   row20: *************** (36.00)
   row21: ************************* (45.00)
   row22: **************************************** (60.00)
   row23: ****** (27.00)
   row24: *** (24.00)
   row25: *********** (32.00)


## Conclusion

Both methods produced the same results for mean, median, and mode, which shows that the manual implementation is accurate.  
The ASCII visualization makes it easy to see the general age distribution — most employees are in their thirties or forties, with only a few outliers on either side.  
Overall, this project helped me understand what pandas does behind the scenes and how the same calculations can be done with just core Python.  
It was also interesting to see how a simple text-based plot can still give a clear sense of the data pattern.