# Assignment: Melt, Pivot, Aggregations and Iteration
### Completed by Michelle Gordon using Breast Cancer dataset

In this assignment you will experiment on your own. Using a health dataset of your choice (check with us if you are not sure), write code to demonstrate the following Pandas functions:

* Melt -DONE
* Pivot -DONE
* Aggregation -DONE
* Iteration -DONE
* Groupby -DONE

Each function demonstration will be for 30 points for a total of 150 points. Ensure that you include comments within your code and follow the rubric as a guide. Submit using your GitHub site. Ask if you have any questions.

In [10]:
import pandas as pd

df = pd.read_csv("breast-cancer.data", header=None)
df.columns = ['recurrence_class', 'age', 'menopause', 'tumor-size', 'inv_nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
df.head()

Unnamed: 0,recurrence_class,age,menopause,tumor-size,inv_nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


### Aggregation
What is the range of ages in the dataframe, and what is the average degree of malignancy of all the cases?

In [2]:
# what are the min and max ages and degree of malignancy, and what is the average degree of malignancy?
df.agg({'age': ['min', 'max'], 'deg-malig' : ['min', 'max', 'mean']})

Unnamed: 0,age,deg-malig
min,20-29,1.0
max,70-79,3.0
mean,,2.048951


### Group By
I want to see what breast regions have the highest number of cases

In [3]:
# count number of cases in dataset grouped by which breast, and location on breast
cases_by_location = pd.DataFrame(df.groupby(['breast', 'breast-quad'], as_index = False)['age'].count())
# giving columns more readable names
cases_by_location.columns = ['breast', 'breast_quadrant', '# of cases']
# sort to show highest values at the top
cases_by_location.sort_values('# of cases', ascending=False)

Unnamed: 0,breast,breast_quadrant,# of cases
2,left,left_low,78
8,right,left_up,61
3,left,left_up,36
7,right,left_low,32
10,right,right_up,24
4,left,right_low,17
1,left,central,11
6,right,central,10
5,left,right_up,9
9,right,right_low,7


It looks like the left_low quadrant of the left breast and the left_up quadrant of the right breast have the most cases

### Iteration
I want to see each individual record

In [4]:
for index, row in df.iterrows():
    print(f"Index#: {index}\nValues:\n{row}\n")
    # print("\n")
    # print(index,row)

Index#: 0
Values:
recurrence_class    no-recurrence-events
age                                30-39
menopause                        premeno
tumor-size                         30-34
inv_nodes                            0-2
node-caps                             no
deg-malig                              3
breast                              left
breast-quad                     left_low
irradiat                              no
Name: 0, dtype: object

Index#: 1
Values:
recurrence_class    no-recurrence-events
age                                40-49
menopause                        premeno
tumor-size                         20-24
inv_nodes                            0-2
node-caps                             no
deg-malig                              2
breast                             right
breast-quad                     right_up
irradiat                              no
Name: 1, dtype: object

Index#: 2
Values:
recurrence_class    no-recurrence-events
age                                40

### Pivot
degree of malignancy by tumor size and age

In [5]:
df.pivot_table(index='tumor-size', columns = 'age', values = 'deg-malig')

age,20-29,30-39,40-49,50-59,60-69,70-79
tumor-size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0-4,,2.0,2.5,1.333333,,1.0
10-14,,1.5,1.625,1.555556,1.5,2.0
15-19,,1.8,2.6,1.7,1.888889,1.0
20-24,,2.333333,2.047619,2.0,1.875,3.0
25-29,,2.166667,2.111111,2.047619,2.222222,
30-34,,2.142857,2.2,2.35,2.076923,
35-39,2.0,3.0,2.0,2.571429,3.0,
40-44,,2.0,1.8,2.625,2.333333,1.0
45-49,,,2.0,,2.0,
5-9,,2.0,1.0,2.0,1.0,


### Melt
Explode the menopause and breast values using age as the index of the values

In [6]:
pd.melt(df, id_vars='age', value_vars=['menopause', 'breast'])

Unnamed: 0,age,variable,value
0,30-39,menopause,premeno
1,40-49,menopause,premeno
2,40-49,menopause,premeno
3,60-69,menopause,ge40
4,40-49,menopause,premeno
...,...,...,...
567,30-39,breast,left
568,30-39,breast,left
569,60-69,breast,right
570,40-49,breast,left
