In [1]:
# Useful Imports
# Feel free to add or change
import thinkplot
import thinkstats2
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
from matplotlib import pyplot as plt

# Assginment 1

## Assignment Overview

In this assignment you'll load some data into a python notebook, and use some basic functions to do some basic analysis. 

### Preamble

One of the things that we can do in notebooks is format things with code and text in a smart and presentable way. One of the things that this assignment will cover is the basics of making things look nice and clear. Text in these markup cells can be formatted with HTML-ish code to make them nice. If you don't know any HTML, it is pretty easy to implement, here's some samples that you can copy for formatting. Overall, the goal here is to make it clear what your answers are, and communicate it to the reader (me). Please use things like headers, labels, lists, etc... to make your findings readable. Double click the text in a markup cell to make it editable and see the markup examples

#### Headers

Headers can be made by using pound signs in front of the text like above. 1 pound sign is H1 (big header) and the size scales down as more pound signs are added, down to 5. 

#### Lists

Lists can be created using HTML markups. Here is an example of a number and non-numbered list, the non-numbered one has an additional level of indenting:

<ol>
<li> Some point
<li> The next point
<li> One more point
</ol>

<ul>
<li> Stuff 
<li> Lots of stuff with details
    <ul>
    <li> Sub-point
    <li> More sub-points
    </ul>
<li> More stuff
</ul>

#### Text Stuff

We can also do things like bold and italic, with new lines:

<b> I'm bold</b><br>
<i> I'm italic</i><br>
<b><i> I'm both </i></b><br>

#### More Info

There is a lot more that can be used to make things pretty. You don't need to be an artist or a web designer, just make it clear. There is a pdf (jupyter_markdown.pdf) in this repository and on Moodle which shows examples of far more stuff.  

Take a look at this (https://www.kaggle.com/firefliesqn/tuning-deepsort-helmet-mapping ) for formatting/layout example – sections are labelled, there’s an explanation, it’s easy to read. Ignore the contents, that’s totally different. 

The goal is to have something that someone can read through and make sense of. 

## Testing

One of the files in this repository is a test harness. This is what I will use to check the answers for the parts that are exact. The only part that is missing from what you have here is a file with the asnwers. 

You can run this file to check yours as you work, you will need to make a few changes:
<ul>
<li> The answers in the test harness are currently set to be read from an answer file. You'll want to comment that out, and set those variables to raw values. 
<li> If you run it, it will check if the results of your code match whatever "answer" you provide. 
<li> This isn't super complex, but it does require a little thought. You don't need to do this for the assignment at all, but if you feel comfortable with the coding, this is good practice. 
<li> <b>Even if you don't actually use the testing, the test harness shows how your functions will be "called" (used), the code in the test harness executes the corresponding functions in your file. This should function, even if you aren't actually checking your results. </b>
</ul>


### Example Function Call

Below is a sample function that mirrors what you're being asked to do. 

<b>Create a function that returns the product of two numbers, plus 1</b>

<b>Parameters</b> - Number A and Number B. 

<b>Return</b> - Those two numbers multiplied, plus 1. 

<b>Starting function framework</b>

    def times_plus_one(numA, numB):
        return 0

This is our starting point - we know we need to take in the two numbers as input, and generate our goal as output. The function <b>should not</b> pull in any other varaibles from the body of our code, it has all that it needs already. This is important, as someone could take this one function, paste it in other programs, and use it as is - that is effectively what we do with the libraries we import. 

<b> Now we need to fill in the code...</b>

In [2]:
# Fill the body of the function to translate from parameters to the return value. 
def times_plus_one(numA, numB):
    temp_product = numA * numB
    return (temp_product + 1)

Now we have a function that, hopefully, works. We can test it by calling it and looking at the results. Note that this same function can be used for all kinds of different calculations.

In [3]:
print(times_plus_one(2,2))
print(times_plus_one(10,2))
print(times_plus_one(2.345,6.78))

5
21
16.899100000000004


We could even take the value, and compare it to a known answer. This is how we test code to make sure it works without error, we can write a "test harness" to run the code, capture the values, and check if they are correct. In real life we could test a large range of values, automatically. Usually the most important tests are those at the "edges", or things that are borderline or uncommon. For example, here we'd likely focus on decimals, negatives, 0s, and non-numerical inputs as things we know we need to test. If the function does fine with (2,2), it'll probably be ok with (5,14), but a 0 or -23.452 may be different enough to cause something unexpected. 

This is how anything that I say will be automatically graded will be assessed - I write a piece of code that looks to run tests, it "grabs" your functions (this is why exact naming is critical here), runs some trials, and checks to ensure the answers match mine. 

In [4]:
# Simple
val_2_2 = times_plus_one(2,2)
print(val_2_2 == 5)

# Or a little more elaborate
val_31_72 = times_plus_one(3,7)
sol_31_72 = (3.1 * 7.2) + 1
print(val_31_72 == sol_31_72)

True
True


## START HERE ##

In [5]:
# Load/preview data
df = pd.read_csv("LabourTrainingEvaluationData.csv")
df.head()

Unnamed: 0,Age,Eduacation,Race,Hisp,MaritalStatus,Nodeg,Earnings_1974,Earnings_1975,Earnings_1978
0,45,LessThanHighSchool,NotBlack,NotHispanic,Married,1,21516.67,25243.55,25564.67
1,21,Intermediate,NotBlack,NotHispanic,NotMarried,0,3175.971,5852.565,13496.08
2,38,HighSchool,NotBlack,NotHispanic,Married,0,23039.02,25130.76,25564.67
3,48,LessThanHighSchool,NotBlack,NotHispanic,Married,1,24994.37,25243.55,25564.67
4,18,LessThanHighSchool,NotBlack,NotHispanic,Married,1,1669.295,10727.61,9860.869


### Part 1 - Age

<ol>
<li> Make and plot a Hist and Pmf for age.
<li> What fraction of the people in the data are 51? What fraction are older than 51?
<li> What is the median age? 
<li> Does the distribution of the sample data seem to mirror the working age population?
</ol>


#### Part 1 - Answers

In [6]:
# 1.1 Plot the HIST and PMF for AGE

In [7]:
# 1.2 Create functions to answer points 2 and 3

# For this function, the parameter "older" should control if the 
# function returns the people who are 51 (False), or the people who are
# older than 51 (True). Here it is defaulted to False. 
def fraction51(older=False):
    return 0

def medianAge():
    return 0

##### Point 5 - Place Answer Here

### Part 2 - Demographics

<ol>
<li>Consider some of the demographic features: 
    <ol>
    <li>Education
    <li>Race
    <li>Hisp
    <li>MaritalStatus
    <li>Nodeg. 
    </ol>
<li>This data came from the United States, does it appear to be a representative sample of the US population?*
<li>Demonstrate this in some way in your code. 
</ol>

Note: you do not need to do deep research on the demographics of the USA to answer this. A high level judegement is fine. 

##### Point 2 - Place Answer Here

In [8]:
# Point 3 - Illustrative Code Here

### Part 3 - Earnings

<ol>
<li>Make and plot a graph or graph of your choosing of the 3 earnings values, in order to answer the below question. Identify how the graph gave you your answer.
<li>What is one conclusion could you draw from visualizing of the earnings in the different year? Please express it in plain language/non-statistics terms/as though you were explaining to one of your friends what happened to earnings between 1974 and 1978?
<li>Which has the greatest effect size on 1978 earnings: Race, Hispanic, MaritalStatus, or Nodeg? 
<li>What could you investigate further in an attempt to explain this?
<li>Plot a histogram and PMF, and compute useful descriptive statistics (think: average...) for the 1978 earnings value. Use the "Cohorts" code from the quiz to break the data into cohorts, plotting each group (either on one chart, or separately, whichever makes the most sense for examining the data - state specifically why you chose 1 vs many charts.
<li>What is the difference in median income between the groups?
<li>Theorize a reason for the difference between the groups that could be explained with more data. Briefly describe the data you'd need. This does not need to be something you have data for, or know how to solve right now - just one well founded hypothesis on what might explain the difference.
<li>Are there outliers in the 1978 earnings data? Demonstrate this in some way with your code. 
<li>What can you infer from the presence of outliers that may impact analysis of this data
</ol>

In [9]:
# Point 1 - Graph of earnings

##### Point 2 - Place Answer Here

In [10]:
# Point 3 - Largest Impact
def largestImpact():
    return 0

##### Point 4 - Place Answer Here

In [11]:
# Point 5 - Chart

In [12]:
# Point 6 - difference in medians
def differenceMedians():
    return 0

##### Point 7 - Place Answer Here

In [13]:
# Point 8 - outliers

##### Point 9 - Place Answer Here