### Data cleaning:

tidy data != clean data

- outlier checking
- date parsing
- missing value imputation etc.
- data tidying: structuring datasets to facilitate analysis.
    
### Data semantics: tidy data
- Value: every value belongs to a variable and an observation.
- Variable: a variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
- Observation: an observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

 ### we'll look at two major kinds of transformations:

- a melt takes the data from wide to long
- a spread, or pivot takes the data from long to wide



### A melt will combine multiple columns into two columns. There are 3 key parameters when melting:

- id_vars: Which vars should not be melted. If omitted, all the columns in the data frame will be melted together.
- var_name: The name of the column that will hold the names of the of the columns that will be combined.
- value_name: The name of the column that will hold the resulting values.

In [8]:
import numpy as np
import pandas as pd

from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns

### Exercise 

- Attendance Data

Load the attendance.csv file and calculate an attendnace percentage for each student. One half day is worth 50% of a full day, and 10 tardies is equal to one absence.


In [36]:
df = pd.read_csv('/Users/dilip/Desktop/Codeup-data-science/classification-exercise /untidy-data/attendance.csv')
df


Unnamed: 0.1,Unnamed: 0,2018-01-01,2018-01-02,2018-01-03,2018-01-04,2018-01-05,2018-01-06,2018-01-07,2018-01-08
0,Sally,P,T,T,H,P,A,T,T
1,Jane,A,P,T,T,T,T,A,T
2,Billy,A,T,A,A,H,T,P,T
3,John,P,T,H,P,P,T,P,P


In [10]:
# rename comlums -- change unnamed to names or name 

In [37]:
df.columns = ['name', '2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08']
df

Unnamed: 0,name,2018-01-01,2018-01-02,2018-01-03,2018-01-04,2018-01-05,2018-01-06,2018-01-07,2018-01-08
0,Sally,P,T,T,H,P,A,T,T
1,Jane,A,P,T,T,T,T,A,T
2,Billy,A,T,A,A,H,T,P,T
3,John,P,T,H,P,P,T,P,P


In [13]:
# melt attendance dataframe 

# treatments.melt(id_vars=['name'], var_name='name_you_want_to_change', value_name='value')

In [38]:
df= attendance.melt(id_vars=['name'], var_name= 'Date', value_name = 'attendance')
df


Unnamed: 0,name,Date,attendance
0,Sally,2018-01-01,P
1,Jane,2018-01-01,A
2,Billy,2018-01-01,A
3,John,2018-01-01,P
4,Sally,2018-01-02,T
5,Jane,2018-01-02,P
6,Billy,2018-01-02,T
7,John,2018-01-02,T
8,Sally,2018-01-03,T
9,Jane,2018-01-03,T


In [None]:
# calculate an attendnace percentage for each student. 
    # first create a function and give value according to question 
            # half day = 50% or 0.5 
            # absence = 0
            # tardy = 0.1 or 10 tardy = 1 absence 
   

In [126]:
def cal_attendance(status):
    if status == 'H':
        point = 0.5
    elif status == 'A':
        point = 0
    elif status == 'T':
        point = '0.9'
    else:
        point =1
    return point 
  

In [18]:
# add new column (point) on attandance_tidy 

In [42]:
df['point']= df.attendance.apply(cal_attendance)

In [127]:
df

Unnamed: 0,name,Date,attendance,point
0,Sally,2018-01-01,P,1.0
1,Jane,2018-01-01,A,0.0
2,Billy,2018-01-01,A,0.0
3,John,2018-01-01,P,1.0
4,Sally,2018-01-02,T,0.9
5,Jane,2018-01-02,P,1.0
6,Billy,2018-01-02,T,0.9
7,John,2018-01-02,T,0.9
8,Sally,2018-01-03,T,0.9
9,Jane,2018-01-03,T,0.9


In [133]:
df.name.value_counts()

Sally    8
John     8
Billy    8
Jane     8
Name: name, dtype: int64

In [129]:
df.point.value_counts()

0.9    14
1       9
0       6
0.5     3
Name: point, dtype: int64

In [132]:
# add total points by each name and divide by 8 
# or get the mean value of each name 


df.groupby(['name']).mean()

DataError: No numeric types to aggregate

### Coffee Levels

- Read the coffee_levels.csv file.
- Transform the data so that each carafe is in it's own column.
- Is this the best shape for the data?

In [79]:
coffee = pd.read_csv('/Users/dilip/Desktop/Codeup-data-science/classification-exercise /untidy-data/coffee_levels.csv')
coffee

Unnamed: 0,hour,coffee_carafe,coffee_amount
0,8,x,0.816164
1,9,x,0.451018
2,10,x,0.843279
3,11,x,0.335533
4,12,x,0.898291
5,13,x,0.310711
6,14,x,0.507288
7,15,x,0.215043
8,16,x,0.183891
9,17,x,0.39156


In [80]:
coffee = coffee.pivot(index = 'hour', columns = 'coffee_carafe', values = 'coffee_amount')
coffee

coffee_carafe,x,y,z
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8,0.816164,0.189297,0.999264
9,0.451018,0.521502,0.91599
10,0.843279,0.023163,0.144928
11,0.335533,0.235529,0.311495
12,0.898291,0.017009,0.771947
13,0.310711,0.997464,0.39852
14,0.507288,0.058361,0.864464
15,0.215043,0.144644,0.436364
16,0.183891,0.544676,0.280621
17,0.39156,0.594126,0.436677


In [81]:
coffee = coffee.reset_index()

In [82]:
coffee

coffee_carafe,hour,x,y,z
0,8,0.816164,0.189297,0.999264
1,9,0.451018,0.521502,0.91599
2,10,0.843279,0.023163,0.144928
3,11,0.335533,0.235529,0.311495
4,12,0.898291,0.017009,0.771947
5,13,0.310711,0.997464,0.39852
6,14,0.507288,0.058361,0.864464
7,15,0.215043,0.144644,0.436364
8,16,0.183891,0.544676,0.280621
9,17,0.39156,0.594126,0.436677


In [None]:
# plot

In [85]:
# remove coffee_carafe
coffee.columns.name = ''

In [84]:
coffee

Unnamed: 0,hour,x,y,z
0,8,0.816164,0.189297,0.999264
1,9,0.451018,0.521502,0.91599
2,10,0.843279,0.023163,0.144928
3,11,0.335533,0.235529,0.311495
4,12,0.898291,0.017009,0.771947
5,13,0.310711,0.997464,0.39852
6,14,0.507288,0.058361,0.864464
7,15,0.215043,0.144644,0.436364
8,16,0.183891,0.544676,0.280621
9,17,0.39156,0.594126,0.436677


### Cake Recipes

- Read the cake_recipes.csv data. This data set contains cake tastiness scores for combinations of different recipes, oven rack positions, and oven temperatures.
Tidy the data as necessary.



In [92]:
recipes = pd.read_csv('/Users/dilip/Desktop/Codeup-data-science/classification-exercise /untidy-data/cake_recipes.csv')
recipes

Unnamed: 0,recipe:position,225,250,275,300
0,a:bottom,61.738655,53.912627,74.41473,98.786784
1,a:top,51.709751,52.009735,68.576858,50.22847
2,b:bottom,57.09532,61.904369,61.19698,99.248541
3,b:top,82.455004,95.224151,98.594881,58.169349
4,c:bottom,96.470207,52.001358,92.893227,65.473084
5,c:top,71.306308,82.795477,92.098049,53.960273
6,d:bottom,52.799753,58.670419,51.747686,56.18311
7,d:top,96.873178,76.101363,59.57162,50.971626


In [107]:
# split recipe:position using str.split

splitted_df = recipes['recipe:position'].str.split(':', expand = True)
splitted_df.columns = ['recipe', 'position']



In [105]:
splitted_df

Unnamed: 0,recipe,position
0,a,bottom
1,a,top
2,b,bottom
3,b,top
4,c,bottom
5,c,top
6,d,bottom
7,d,top


In [117]:
# concat recipes & splitted_df
tidy_df = pd.concat([recipes, splitted_df], axis=1).drop(columns='recipe:position')
tidy_df

Unnamed: 0,225,250,275,300,recipe,position
0,61.738655,53.912627,74.41473,98.786784,a,bottom
1,51.709751,52.009735,68.576858,50.22847,a,top
2,57.09532,61.904369,61.19698,99.248541,b,bottom
3,82.455004,95.224151,98.594881,58.169349,b,top
4,96.470207,52.001358,92.893227,65.473084,c,bottom
5,71.306308,82.795477,92.098049,53.960273,c,top
6,52.799753,58.670419,51.747686,56.18311,d,bottom
7,96.873178,76.101363,59.57162,50.971626,d,top


- Which recipe, on average, is the best? recipe b


In [None]:
find the mean 

In [110]:
tidy_df.groupby('recipe').mean().mean(axis=1)

recipe
a    63.922201
b    76.736074
c    75.874748
d    62.864844
dtype: float64

- Which oven temperature, on average, produces the best results? 275


In [None]:
# 

In [118]:
tidy_df = tidy_df.melt(id_vars=['recipe','position'], var_name='temperature', value_name='result')
tidy_df.head()

Unnamed: 0,recipe,position,temperature,result
0,a,bottom,225,61.738655
1,a,top,225,51.709751
2,b,bottom,225,57.09532
3,b,top,225,82.455004
4,c,bottom,225,96.470207


In [119]:
tidy_df.groupby('temperature').mean().mean(axis=1)

temperature
225    71.306022
250    66.577437
275    74.886754
300    66.627655
dtype: float64

- Which combination of recipe, rack position, and temperature gives the best result? recipe b, bottom rack, 300 degrees

In [120]:
tidy_df

Unnamed: 0,recipe,position,temperature,result
0,a,bottom,225,61.738655
1,a,top,225,51.709751
2,b,bottom,225,57.09532
3,b,top,225,82.455004
4,c,bottom,225,96.470207
5,c,top,225,71.306308
6,d,bottom,225,52.799753
7,d,top,225,96.873178
8,a,bottom,250,53.912627
9,a,top,250,52.009735


In [122]:
tidy_df.result.max()

99.24854054

In [124]:
tidy_df[tidy_df.result == tidy_df.result.max()]

Unnamed: 0,recipe,position,temperature,result
26,b,bottom,300,99.248541
