Adapted from a tutorial by [Shane Lynn](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/)

# THE ROLODEX

![rolodex](https://nibsblog.files.wordpress.com/2012/04/mgmrolodexfileebay-opt.jpg)

What's interesting about the ```uk-500.csv``` file is none of it is real, yet it's realistic, starting from real person and company names, but then scrambled.  Street addresses are not actual.  Nor are the websites or email addresses.

[Here's a source](https://www.briandunning.com/sample-data/) for more data like this.

Data anonymization is a task for software development in itself.  Adding realism is often the goal.  Hand-entering a few "phony records" is insufficient.  

Allowing medical records to remain actual (real) while changing patient information to similated data, allows a program to keep using the same data structures, including for medical research purposes.

### LAB CHALLENGE

Download the ```us-500.csv``` file from the above link.  Or perhaps you can find it on Github and read in in over the web.

### LAB CHALLENGE

Swap the above rolodex picture for another one.  Use a URL.

In [None]:
import pandas as pd
import numpy as np
 
# read the data from the downloaded CSV file.
data = pd.read_csv('https://s3-eu-west-1.amazonaws.com/shanebucket/downloads/uk-500.csv')

# set a numeric id for use as an index for examples.
data['id'] = np.arange(500, dtype=np.int).reshape(data.shape[0])

data.head(5)

In [None]:
# Single selections using iloc and DataFrame
# Rows:
#data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
#data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
# Columns:
#data.iloc[:,0] # first column of data frame (first_name)
#data.iloc[:,1] # second column of data frame (last_name)
#data.iloc[:,-1] # last column of data frame (id)

In [None]:
data.set_index("last_name", inplace=True)
data.head()

In [None]:
data.loc[['Andrade', 'Veness'], 'city':'email']

In [None]:
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]

### LAB CHALLENGE:

Can the above be right?  We ask for all rows between an A and a V name and get only three total?  The table is not sorted.  Last names are in random order.  How might we "fix" that?

[Here's a hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)

After sorting by ```last_name``` I get a count of 455 records between these two names inclusive.  Do you get that too?

In [None]:
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']].count()

In [None]:
data.set_index('id', inplace=True)

In [None]:
data.loc[102]

In [None]:
data.loc[data['first_name'] == 'Antonio', 'city':'email']

In [None]:
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]  # checking against a list

In [None]:
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')] 

In [None]:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4) # company name is 4 words
data.loc[idx, ['email', 'first_name', 'company_name']].head()

## GROUPBY

This data set looks ideal for practicing groupby. How many contacts do I have by county?

In [None]:
county_tally = data.groupby("county")["county"].count()

In [None]:
county_tally.dtypes

In [None]:
county_tally.max()

### LAB CHALLENGE

Which county has these 44 listings?