# Homework 1

## ECE 204 Data Science & Engineering

Consider the two CSV files named `a.csv` and `b.csv`. CSV files (Comma-Separated Values) are a way of storing tabular data in plain-text format. Each new column is separated by a marker (usually a comma) and each new row is indicated by a new line. For example, if the file contained:
```
label,foo,bar
a,1.1,2.4
b,2.6,1.5
```
Then it would encode the table:

| label  | foo | bar |
|--------|-----|-----|
|  a     | 1.1 | 2.4 |
|  b     | 2.6 | 1.5 |

Programs such as Excel can parse CSV files automatically and import them into spreadsheets. It turns out Pandas also has a function to import CSV files directly as a Pandas DataFrame. Run the following code cell to import the two CSV files into dataframes named `df1` and `df2`.

In [7]:
import pandas as pd

df1 = pd.read_csv('a.csv')
df2 = pd.read_csv('b.csv')

The files contain databases of people (one person per row) with different attributes for each of them (each attribute has a column). The data includes address, age, net worth, and an essay. (all these data are fabricated, of course). Use the `head` method to examine the contents of these dataframes by running `df1.head()` and `df2.head()` in the next cell.

In [9]:
df1.head()

Unnamed: 0,name,age,address,essay
0,Ryan Gallagher,75,"6317 Mary Light Smithview, HI 13900",Against power across. Rather why rise month sh...
1,Theresa Brown,20,"449 Austin Rapid Suite 685 Seanburgh, AK 61435",Interest this clearly concern discover compute...
2,Brian Foster,61,"4038 Hill Drive East Sarafort, SC 25854",Face test which summer head. Front hold eat ea...
3,Natalie Pope,87,USNV Nielsen FPO AE 61531,Too both light. Herself bill economic room imp...
4,Samantha Washington,44,"7550 Laurie Row South Evelyn, NV 96037",Rate such friend behavior song source knowledg...


In [10]:
df2.head()

Unnamed: 0,name,net_worth
0,Ryan Gallagher,217.02
1,Theresa Brown,235.99
2,Joshua Wood,350.8
3,Natalie Pope,21.99
4,Samantha Washington,692.86


---
**Problem 1.** The two files contain records on some of the same people, but there are people in `df1` that are not in `df2` (and vice versa). Your first task is to create a new dataframe called `df` that contains (1) the columns from both `df1` and `df2`, but (2) only the names that belong to both `df1` and `df2`. **Hint:** what we're doing here is called an "inner join"; it takes the intersection of values in the "name" column, which is shared in both DataFrames. Take a look at the function [pd.merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html). There is also more information on [DataFrame joining and merging](https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging).

Find the number of names in your newly created `df`.



In [12]:
df = pd.merge(df1, df2, on='name', how='inner')
df

Unnamed: 0,name,age,address,essay,net_worth
0,Ryan Gallagher,75,"6317 Mary Light Smithview, HI 13900",Against power across. Rather why rise month sh...,217.02
1,Theresa Brown,20,"449 Austin Rapid Suite 685 Seanburgh, AK 61435",Interest this clearly concern discover compute...,235.99
2,Natalie Pope,87,USNV Nielsen FPO AE 61531,Too both light. Herself bill economic room imp...,21.99
3,Samantha Washington,44,"7550 Laurie Row South Evelyn, NV 96037",Rate such friend behavior song source knowledg...,692.86
4,James Taylor,58,"9158 Graves Route Port Stephanieview, AZ 17858",As read smile thought try day difference socie...,-189.63
...,...,...,...,...,...
313,Jennifer Silva,44,"4836 Sandra Turnpike Suite 045 Natalieton, MN ...",Moment red attention bill expert play quite. D...,-132.51
314,Michael Reyes,86,"045 Deanna Trail Port Deanna, RI 84675",Every stage police across include mention. Rec...,524.56
315,Joshua Mccarthy,50,"15607 Betty Passage Taraville, VA 63693",Born her arm chair. Total necessary a return o...,122.44
316,David Perry,95,45395 Kirk Shoal Suite 264 East Stephenborough...,North civil style challenge. Memory fill accor...,234.99


---
**Problem 2.** What is the full name of the person with the highest net worth in `df2`? For this problem, break it up into two steps: First consider the `net_worth` column and find the index of its largest value (look at the method `pd.Series.idxmax`). Then, use the index you found to extract the required information.

In [19]:
max_id = df2["net_worth"].idxmax()
df2.iloc[max_id]

name         Veronica Bentley
net_worth              797.27
Name: 290, dtype: object

---
**Problem 3.** What is the full address of the person that wrote the shortest essay in `df1`?

**Hint 1:** If you want to apply a function such as `len()` to an entire column of a dataframe 
(apply `len` to each item in the column), take a look at the method `pd.Series.apply`.

**Hint 2:** If you used `pd.Series.idxmax` to find the index associated the maximum value of a Series, which method should you use the find the index of the minimum value?

In [52]:


df1["essay_length"] = df1["essay"].apply(len)


min_length_index = df1["essay_length"].idxmin()


shortest_person_address = df1.loc[min_length_index, "address"]

shortest_person_address

'177 Jeffrey Forge Paigestad, MA 51130'

---
**Problem 4.** What is the sum of all the 5-digit ZIP codes at the end of every address in `df1`? For example, if two addresses are "Unit 8092 Box 6526 DPO AE 62605" and "1048 Bryant Ports Lake Victoria, WY 96197". The two 5-digit codes at the end of the address are 62605 and 96197. The sum of these two numbers is 62605 + 96197 = 158802.   **Hint:** Consider creating a custom function that extracts the ZIP code from a single string, and then use the `apply` method to apply your function to each address. 

In [70]:
zip = df1["address"].apply(lambda x : x[-5:]).apply(int)
sum(zip)

31727403

---
**Problem 5.** How many people in `df1` are 70 years old or older? **Hint:** If you use a logical statement such as x < 3 (which typically returns a `bool`), but instead apply the comparison to a Series or DataFrame (e.g. imagine `x` was a Pandas Series instead of a single number), the comparison will apply element-wise to all items in the Series. Try it out!

In [83]:
df1[df1["age"] >= 70]



Unnamed: 0,name,age,address,essay,essay_length
0,Ryan Gallagher,75,"6317 Mary Light Smithview, HI 13900",Against power across. Rather why rise month sh...,198
3,Natalie Pope,87,USNV Nielsen FPO AE 61531,Too both light. Herself bill economic room imp...,190
5,Chris Curtis,80,"0830 Robert Forest Suite 091 Mccoystad, NH 07000",News card industry. Brother final staff Congre...,161
6,Victor Martinez,95,"83607 Peter Parkway Powellport, OH 68681",Top here box election yard as per. Blue around...,168
10,Rachel Meyer,78,"492 Rodriguez Lake Davenportmouth, AK 81451",Turn box lay tend. Sort increase between road....,152
...,...,...,...,...,...
606,Ashley Smith,74,"67692 Brandi Path Suite 047 East Carrie, MO 91316",Key southern partner develop could air recentl...,177
607,Kelsey Cruz,75,"90930 Baker Ports Apt. 164 Jeremymouth, MI 48317",Brother billion name accept impact.\r\r\nWonde...,135
609,Michael Reyes,86,"045 Deanna Trail Port Deanna, RI 84675",Every stage police across include mention. Rec...,143
613,David Perry,95,45395 Kirk Shoal Suite 264 East Stephenborough...,North civil style challenge. Memory fill accor...,164
