** Description ** - This assignment is intended to teach you about data processing using python, reporting descriptive statistics on a variety of data types, and thinking critically about common data processing challenges.

**Getting Started** - You should complete the assignment using your own installation of Python 3 and the packages numpy and pandas. Download the assignment from Moodle and unzip the file. This will create a directory with this file, 'HW01.ipynb', and a 'data' directory. The data files for each data set are in the 'data' directory.

** Deliverables ** - The assignment has two deliverables: a report and a jupyter notebook code file. You will submit both of these files together in a single compressed folder on Moodle.

- ** Report ** -  The solution report will give your answers to the homework questions (listed below). You can use any software to create your report, but your report _must_ be submitted as a .pdf file.
- ** Code ** - In order to submit the coding component of this assignment, first complete every coding assignment in this document by writing code in the sections labeled  --- write code here ---. After completing the coding problems save a copy of this file and submit it along with your report. This code should be written in a clean and organized fashion, and should be able to be run by the instructor and the TA by executing each code cell in order from top to bottom. It is strongly suggested that you reset the notebook and execute every cell in order before submitting.

**Data Sets** - In this assignment, you will conduct a descriptive statistical analysis on 5 data files in the 'data' folder. The first dataset, 'census.csv', contains annonymized US census records. The remaining datasets contain samples from common univariate distributions.

**Academic Honesty Statement** — Copying solutions from external sources (books,
web pages, etc.) or other students is considered cheating. Sharing your solutions with
other students is considered cheating. Posting your code to public repositories such as
GitHub is also considered cheating. Any detected cheating will result in a grade of 0 on
the assignment for all students involved, and potentially a grade of F in the course.

This academic honesty statement does not restrict you from reading official documentation or using other web resources for understanding the syntax of python, related data science libraries, or properties of distributions.

In [2]:
# Do not import any other libraries other than those listed here. 

import numpy as np
import pandas as pd

# Problem 1 - US Census Data

In this problem you'll analyze a sample of publically released US Census data.

**Part 1** (3 points)  
Load the census.csv file into a pandas dataframe object. Once loaded, print the first 5 rows of the dataframe. See https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.html for documentation on pandas dataframes. 

In [3]:
# Part 1 Solution

# --- write code here ---
#Robert Rossetti
foo=pd.read_csv('data/census.csv')
df=pd.DataFrame(data=foo)
df.iloc[0:5]






Unnamed: 0,Age,Workclass,Education,Marital Status,Occupation,Relationship,Capital-Gain,Capital-Loss,Hours-Per-Week,Country-of-Origin,Annual Income
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,0,0,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,0,0,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,0,0,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,0,0,40,Cuba,<=50K


** Part 2** (11 points)  
For each variable (column), describe its statistical type from the following list:  

- Nominal
- Ordinal  
- Interval 
- Ratio scale

Justify your decision in a single sentence for each variable.

** Part 3** (11 points)
How might pandas datatypes such as int, float, string, etc. help determine a variable's statistical type? Are there any circumstances where using pandas datatypes would be misleading?

** Part 4** (24 points)  
Add a new column to the dataframe, 'College-Degree'. For each person in the census, assign a value of 1 to this column if they have completed an Associates, Bachelors, Masters, or Doctorate degree and a value of 0 otherwise. 

Is the average income for a person with a college degree greater than the average income for a person without a college degree? Is it possible to answer this question given the available data? If not, propose a variant of the question that can be answered and answer it.

Hint: You may find it helpful to convert the 'Annual Income' column into numeric values. String matching in pandas is case sensitive, and can easily result in errors due to unanticipated string characters such as hyphens, underscores, and spaces.

In [5]:
# Part 4 Solution

# --- write code here ---
#check if college-degree column already inserted
col3=list(df.columns.values)[3]
if(col3!="College-Degree"):
    #print("Inserting college-degree column")
    df.insert(3,"College-Degree","0")

for rowindex in range(len(df.index)):
    ed=df.at[rowindex,"Education"].strip()
    if(ed=="Assoc-voc" or ed=="Assoc-acdm" or ed=="Bachelors" or
      ed=="Masters" or ed=="Doctorate"):
        df.at[rowindex,"College-Degree"]=1

#compute percent of degree holders making >50k versus those without
cdtotal=0
nocdtotal=0
cdMoreThan50k=0
nocdMoreThan50k=0
for i in range(len(df.index)):
    if(df.at[i,"College-Degree"]==1):
        cdtotal+=1
        income=df.at[i,"Annual Income"].strip()
        if(income[0]=='>'):
            cdMoreThan50k+=1
    else:
        nocdtotal+=1
        income=df.at[i,"Annual Income"].strip()
        if(income[0]=='>'):
            nocdMoreThan50k+=1
cd50kp=cdMoreThan50k/cdtotal
nocd50kp=nocdMoreThan50k/nocdtotal
print("Percentage of people with a college degree who make >50k= "+str(cd50kp*100)+"%")
print("Percentage of people with NO college degree who make >50k= "+str(nocd50kp*100)+"%")
#print("Therefore, having a college degree increases likelihood of making >50k by "+str(cd50kp/nocd50kp))

Percentage of people with a college degree who make >50k= 41.3682092555332%
Percentage of people with NO college degree who make >50k= 16.484682374784494%


** Part 5** (24 points)    
Like many real-world datasets, the census dataset contains several missing records. For this assignment we'll focus on missing records for the 'Occupation' and 'Country-of-Origin' variables.

Construct a new dataframe which is a copy of the census dataframe, removing any rows where the 'Occupation' variable is missing. Compare the average 'Capital-Gain' and 'Capital-Loss' between the original and the filtered data frames. Repeat this procedure, instead removing any rows where the 'Country-of-Origin' variable is missing.

How do our estimates of average 'Capital-Gain' and 'Capital-Loss' change when we omit the rows with missing 'Occupation' and 'Country-of-Origin' variables? Provide a plausible explanation for why filtering using different variables produces different results. Is this approach to handling missing data justified? Why or why not? If not, describe an alternative procedure for handling missing data.

In [None]:
# Part 5 Solution

# --- write code here ---
df2=df.copy(deep=True) #for filtering occupation
df3=df.copy(deep=True) #for filtering country of origin
#drop rows where occupation is "?"
def filterMissingData(data,colName):
    for i in range(data.shape[0]):
        colVal=(data.at[i,colName]).strip()
        if(colVal=='?'):
            data=data.drop([i])
    print("Number of rows dropped="+str(df.shape[0]-data.shape[0]))
    return data
df2=filterMissingData(df,"Occupation")
df3=filterMissingData(df,"Country-of-Origin")
def computeGainLossAverage(data):
    data.reset_index(drop=True, inplace=True)
    #print("Size of input data "+str(data.shape[0]))
    total=data.shape[0]
    gainTotal=0
    lossTotal=0
    for i in range(data.shape[0]):
        gainTotal+=data.at[i,"Capital-Gain"]
        lossTotal+=data.at[i,"Capital-Loss"]
    print("\tAverage capital gain="+str(gainTotal/total))
    print("\tAverage capital loss="+str(lossTotal/total))
print("Average capital gain/loss BEFORE filtering:")
computeGainLossAverage(df)
print("AVerage capital gain/loss AFTER filtering rows with no occupation:")
computeGainLossAverage(df2)
print("AVerage capital gain/loss AFTER filtering rows with no country of origin:")
computeGainLossAverage(df3)




Number of rows dropped=1843
Number of rows dropped=583
Average capital gain/loss BEFORE filtering:
	Average capital gain=1077.6488437087312
	Average capital loss=87.303829734959
AVerage capital gain/loss AFTER filtering rows with no occupation:
	Average capital gain=1106.0370792369295
	Average capital loss=88.91021550882219
AVerage capital gain/loss AFTER filtering rows with no country of origin:
	Average capital gain=1064.3606229282632
	Average capital loss=86.73935205453749


# Problem 2 - Synthetic Data

In this problem you'll be asked to compare samples from the following 4 distributions.

 - Poisson
 - Uniform
 - Normal
 - Truncated Normal

These datasets of samples have been randomly orderred, and are named 'synthetic1.csv', 'synthetic2.csv', etc. in the 'data' directory.

**Part 1** (3 points)    
Load each of the 4 datasets as numpy arrays. See https://docs.scipy.org/doc/numpy/user/index.html for documentation on numpy arrays.

In [6]:
# Part 1 Solution

# --- write code here ---
s1 = np.genfromtxt ('data/synthetic1.csv', delimiter=",")
s2 = np.genfromtxt ('data/synthetic2.csv', delimiter=",")
s3 = np.genfromtxt ('data/synthetic3.csv', delimiter=",")
s4 = np.genfromtxt ('data/synthetic4.csv', delimiter=",")



**Part 2**  (24 points)    
Match each dataset of samples to the distribution that was used to generate it **without using any visualization tools such as matplotlib**. Visualization is an important method in data science, but will be covered in future homeworks. 

Describe your findings in the report, including a 1-2 sentence justification for each dataset.

In [10]:
# Part 2 Solution

# --- write code here ---
import sys
import math
def stats(s):
    mini=float("inf")
    maxi=float("-inf")
    mean=0
    dev=0
    var=0
    for x in s:
        mean+=x
        if(x>maxi):
            maxi=x
        if(x<mini):
            mini=x
    n=len(s)
    mean=mean/n
    sumsq=0
    for x in s:
        sumsq+=math.pow(x-mean,2)
    var=sumsq/n
    dev=math.pow(var,2)
    print("\tmin={}, max={}".format(mini,maxi))
    print("\tmean="+str(mean))
    print("\tstandard deviation="+str(dev))
    print("\tvar="+str(var))
    return 

print("Synthetic 1: ")
stats(s1)
print("Synthetic 2: ")
stats(s2)
print("Synthetic 3: ")
stats(s3)
print("Synthetic 4: ")
stats(s4)

Synthetic 1: 
	min=-4.45563583217179, max=4.538328377489084
	mean=-0.001036969451549452
	standard deviation=1.0267256955498782
	var=1.0132747384346845
Synthetic 2: 
	min=-2.9944118022511796, max=2.998579407072315
	mean=0.0028225889165691947
	standard deviation=0.9314291808291286
	var=0.9651057873772847
Synthetic 3: 
	min=0.0, max=8.0
	mean=1.00173
	standard deviation=0.9931457996417609
	var=0.9965670071007573
Synthetic 4: 
	min=-2.9999504172128457, max=2.999928696273451
	mean=0.003365529982741416
	standard deviation=8.964476436098694
	var=2.9940735522192328
