# Assignment #1 - Basic Data Exploration, Visualization, and Analysis

## Assignment Overview

In this assignment you'll load some data into a python notebook, and use some basic functions to do some basic analysis. Each section asks you to either calculate some answers or explore some information found in the data. When generating your answers, try to think about a few factors that can make your code better:
<ul>
<li> Present the answers clearly. Use the markdown cells, code comments, and formatting to help make your answers readable. One of the features of notebooks is that they allow us to combine code and commentary, and we need to ensure both are readable. Refer to the guide in the guides folder of the exercises workbook for an explaination and examples of different formatting. 
<li> Make your code clear. It is easy to make sense of small pieces of code, for short periods of time, so if your code makes little sense here, it won't really hurt your ability to find the answers. If you need to come back to it later, or others need to edit it, having code that doesn't make sense is a big issue. Use things like clearly named variables, comments, and spacing to make things readable. Even in this course, if you are looking back to something from 2 months ago for the project, it is far easier to understand code that is cleaned up a little. 
<li> Structure the code well. If there is some kind of repetitive task, it should likely be moved into a function. If there is something that happens several times, it should be in a loop. Having well structured code makes it easy to reuse stuff later, understand how things work, debug errors, and share code with others. This is something to keep in the back of your minds, right now you may not have much experience to lean on to judge how things should be, as you read, adjust, and write code it will become more clear. 
</ul>

## Grading

This assignment will be graded in two portions:
<ul>
<li> 50% - Correctness and functionality. Parts of the assignment (the functions you are asked to write) will be graded on whether they work correctly and generate correct answers. </li>
<li> 50% -Analysis and presentation. Parts of the assignment (the markdown cells you are asked to fill in) will be graded on whether they present the answers clearly, and whether the analysis is correct. </li>
</ul>

## Load Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import asn1_function_sheet as afs

try:
    df = pd.read_csv("LabourTrainingEvaluationData.csv")
except FileNotFoundError:
    df = pd.read_csv("../data/LabourTrainingEvaluationData.csv")
df["Nodeg"].astype('category', copy=False)
df.head()

In [None]:
df.describe()

### Part 1

<ol>
<li> Create a function called age_splitter that takes a dataframe, a column name, and an age threshold as input. The function should return two dataframes, one with all the rows where the age in the specified column is below the threshold, and one with all the rows where the age in the specified column is above or equal to the threshold. </li>
<li> Use this function to calculate the percentage of people in the dataset that are below 30 years old. </li>
<li> Use this function to compare the 1978 earnings of the two groups to see which is larger - show this arithmatically as well as visually. </li>
</ol>

In [None]:
# 1 - Demo of function
under30, over30 = afs.age_splitter(df, "Age", 30) # Split with 30 as the threshold
#print(under30)
#print(over30)

# 2 Percentage Under 30
pct_under30 = len(under30) / len(df) * 100
print(f"Percentage of people under 30: {pct_under30:.2f}%")

# 3 - 1978 Earnings Comparison
mean_under30 = under30["Earnings_1978"].mean()
mean_over30 = over30["Earnings_1978"].mean()

print(f"Mean earnings (1978) under 30: {mean_under30:.2f}")
print(f"Mean earnings (1978) 30 and over: {mean_over30:.2f}")

# Visual Representation
import matplotlib.pyplot as plt

plt.bar(["Under 30", "30 and over"], [mean_under30, mean_over30])
plt.ylabel("Mean 1978 Earnings")
plt.title("Comparison of 1978 Earnings by Age Group")
plt.show()


### Part 2

<ol>
<li>Create a function in the .py file called cohortCompare that takes two arguments - a dataframe and a list of categorical column names. The function should return a dictonary of the key statistics of each numerical columns and counts for categorical columns.</li>
    <ul>
    <li> Mean, Median, Standard Deviation, Min, Max for numerical columns </li>
    <li> Counts for categorical columns </li>
    <li><b>Note:</b> Please use the CohortMetric object to store and manage the statistics for each cohort.</li>
    </ul>
<li> Does this data, at a high level, appear to be representative of the general population of the US in the late 70s? Does it now? Why or why not? </li>
    <ul>
    <li> This does not need to be a long answer or done in increadable depth. This question will generate some demographic profiles of people in the data - does that appear to be similar to the US population at the time? </li>
    <li> Please state how you assessd this. (There isn't one correct answer, the process is more important than the answer) </li>
    </ul>
<li>Print the dictionary returned in a nice-ish way. (Don't go crazy, basic formatting)</li>
</ol>

In [None]:
# 1 - Demo of function
# Split columns correctly
numeric_cohorts = ["Age", "Earnings_1978"]   # numbers only
categorical_cohorts = ["Eduacation", "Race", "Hisp", "MaritalStatus", "Nodeg"]

# Run cohortCompare on numeric cohorts
results = afs.cohortCompare(df, numeric_cohorts)

print("=== Numeric Cohort Statistics ===")
for cohort, metrics in results.items():
    print(metrics)

# Handle categorical separately
print("\n=== Categorical Cohort Counts ===")
for col in categorical_cohorts:
    print(f"\nCohort: {col}")
    print(df[col].value_counts())

Answer to Question 2: 
The dataset is a roughly accurate representation of the US population in the late 1970s. Most individuals are working-age adults, and most have a highschool education. Black individuals are underrepresented and married couples seem to be overrepresented. Overall, the data gives a reasonable representation of the era, even if not perfectly proportional.

### Part 3

<ol>
<li> Create a function in the .py file called effectSizer that takes in a dataframe, a numerical column name, a column name of a categorical value that is binary (two values only), and returns a dictionary of the categorical classes and their corresponding effect sizes on the numerical value. </li>
<li> For 1978, which of Race, Hisp, MaritalStatus have the largest effect size. (Use Yes/True/1 for x1)</li>
</ol>

In [None]:
# 1. Demo of function

# Effect sizes comparison
cols_to_check = ["Race", "Hisp", "MaritalStatus"]
effect_sizes = {}

for col in cols_to_check:
    try:
        effect_sizes[col] = afs.effectSizer(df, "Earnings_1978", col)
    except ValueError as e:
        print(f"Skipping {col}: {e}")

# Sort and print with largest effect highlighted
max_effect = max(effect_sizes.values())
print("Effect sizes for Earnings_1978:\n")
for col, d in sorted(effect_sizes.items(), key=lambda x: x[1], reverse=True):
    marker = "<-- largest effect" if d == max_effect else ""
    print(f"{col:15}: d = {d:.2f} {marker}")