# Exercise 1: Introduction to Pandas

In this exercise, we will work with the pandas library, which is one of the most important Python packages for data analysis and manipulation.  
The documentation of this package can be found here: https://pandas.pydata.org/docs/

In [None]:
import pandas as pd

## Examples: The Iris Dataset

We read in the Iris dataset that we obtained from https://archive.ics.uci.edu/ml/datasets/Iris and walk through a few basic examples.

In [None]:
# read data from a file into a data frame, specify column names by hand
df = pd.read_csv("iris.data", names = ["sepal_length", "sepal_width"," petal_weight", "petal_width", "class"])
df

#### Accessing rows and columns

In [None]:
# columns can be accessed as attributes
df.sepal_length

In [None]:
## rows and columns can be acessed by index as well
# -> use loc to access columns by name
print(df.loc[:,"sepal_length"])

In [None]:
# -> use iloc to access columns by numerical index
print(df.iloc[:,3])
print(df.iloc[4:,3])
print(df.iloc[4,3])

#### Advanced selection and built-in functions

In [None]:
# get all rows where sepal length is bigger than 5

df.loc[df["sepal_length"]>5]
#df["sepal_length"]>5

In [None]:
# get mean value of petal width
print(df.sepal_length.mean())

In [None]:
# get unique class values
print(df["class"].unique())

## Task 1: Exploring Census Data

In this task we work with the adult dataset, which has been axtracted from a 1994 census dataset.  
A brief documentation can be found here: https://archive.ics.uci.edu/ml/datasets/adult

__a)__ Read in the "adult.csv" file and print its ```head()``` to get a little overview of it. Note that this dataset contains NAs which are encoded as '?' and should be converted accordingly. How many rows and colums does this dataset have?

__b)__ Compute the mean 'working time per week'!

__c)__ Give the unique values that occur in the attribute 'education'. Further, give the number of people in the dataset that have obtained each specific education level!

__c)__ List all persons with a Bachelor degree as their highest degree, sorted by their ```capital-loss``` in descending order. What is the sum of ```capital-loss``` for these persons?

__d)__ How many males have a bachelor degree as their highest degree?

__e)__ List the 10 youngest persons with a bachelor degree or higher. _Hint: consider the_ ```education-num``` _attribute_.

__f)__ Show for each combination of sex and race, how many instances (people) are contained in the dataset.  _Hint: consider panda's_ ```groupby()``` _function in this as well as in the following subtasks_.

__g)__ What is the mean age of men and women in this dataset?

__h)__ Show for each combination of marital-Status and race how many males/females over 40 years have a bachelor degree as their highest degree?

## Task 2: Organizing a Book and Movie shop

For a virtual shop that sells movies and books, we have four tables:
    * pd_customers: Gives first- and lastname for each customer
    * pd_books: Gives the raw price for all books that are being sold
    * pd_movies: Gives the raw price for all movies that are being sold
    * pd_transactions: Gives the list of all transactions being made (which customer bought which item)

__a)__ Load all 4 datasets in separate dataframes!

__b)__  Compile a listing of all items (i.e., books and movies) that have been sold in one of the dataframes.
The resulting dataframe should contain two columns: ```"item_name"``` and ```"price"```. _Hint: consider panda's_ ```concat()``` _function_.

__c)__ Join the information on customer names, transactions, and prices into a single dataframe. _Hint: consider panda's_ ```merge()``` _function_.

__d)__ Compute a table of customers. For all customers give the number of items bought, the total price of these items, and the average price of these items.

__e)__  Round the average price to two digits and export the resulting table to a csv-file!

__f)__ Compute lists of the top 10 bestselling items, both by count and by sum of prices