# Ex - GroupBy

### Introduction:

GroupBy can be summarizes as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns 


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

In [2]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv"
df = pd.read_csv(url,sep = ',')
df.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


### Step 3. Assign it to a variable called drinks.

In [3]:
drinks = df.copy()
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


### Step 4. Which continent drinks more beer on average?

In [4]:
total = drinks['beer_servings'].sum()
print("Beer_servings_Total",total)
drinks.groupby(['continent'])['beer_servings'].sum()

Beer_servings_Total 20489


continent
AF    3258
AS    1630
EU    8720
OC    1435
SA    2101
Name: beer_servings, dtype: int64

In [5]:
drinks.groupby(['continent'])['beer_servings'].sum() / total *100

continent
AF    15.901215
AS     7.955488
EU    42.559422
OC     7.003758
SA    10.254283
Name: beer_servings, dtype: float64

In [6]:
np.sum(drinks.groupby(['continent'])['beer_servings'].sum() / total *100)

83.67416662599443

In [7]:
def avg_beer_continent(x):
    total = x['beer_servings'].sum()
    x['avg_beer'] = x['beer_servings']/total
    return x

drinks_avg = drinks.groupby('continent').apply(avg_beer_continent)
drinks_avg.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent,avg_beer
0,Afghanistan,0.0,0.0,0.0,0.0,AS,0.0
1,Albania,89.0,132.0,54.0,4.9,EU,0.010206
2,Algeria,25.0,0.0,14.0,0.7,AF,0.007673
3,Andorra,245.0,138.0,312.0,12.4,EU,0.028096
4,Angola,217.0,57.0,45.0,5.9,AF,0.066605


In [8]:
drinks_avg.groupby('continent')['avg_beer'].sum()

continent
AF    1.0
AS    1.0
EU    1.0
OC    1.0
SA    1.0
Name: avg_beer, dtype: float64

In [9]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


### Step 5. For each continent print the statistics for wine consumption.

In [10]:
drinks.groupby('continent').describe()

Unnamed: 0_level_0,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,spirit_servings,spirit_servings,...,wine_servings,wine_servings,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0,53.0,16.339623,...,13.0,233.0,53.0,3.007547,2.647557,0.0,0.7,2.3,4.7,9.1
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0,44.0,60.840909,...,8.0,123.0,44.0,2.170455,2.770239,0.0,0.1,1.2,2.425,11.5
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0,45.0,132.555556,...,195.0,370.0,45.0,8.617778,3.358455,0.0,6.6,10.0,10.9,14.4
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0,16.0,58.4375,...,23.25,212.0,16.0,3.38125,3.345688,0.0,1.0,1.75,6.15,10.4
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0,12.0,114.75,...,98.5,221.0,12.0,6.308333,1.531166,3.8,5.25,6.85,7.375,8.3


### Step 6. Print the mean alcohol consumption per continent for every column

In [11]:
drinks.groupby('continent').mean()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.471698,16.339623,16.264151,3.007547
AS,37.045455,60.840909,9.068182,2.170455
EU,193.777778,132.555556,142.222222,8.617778
OC,89.6875,58.4375,35.625,3.38125
SA,175.083333,114.75,62.416667,6.308333


### Step 7. Print the median alcohol consumption per continent for every column

### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users.

In [None]:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')


### Step 4. Discover what is the mean age per occupation

### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

### Step 6. For each occupation, calculate the minimum and maximum ages

### Step 7. For each combination of occupation and gender, calculate the mean age

### Step 8.  For each occupation present the percentage of women and men

In [None]:
# create a data frame and apply count to gender

# create a DataFrame and apply count for each occupation

# divide the gender_ocup per the occup_count and multiply per 100

# present all rows from the 'gender column'


# Regiment

### Introduction:

Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.

### Step 1. Import the necessary libraries

### Step 2. Create the DataFrame with the following values:

In [None]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}

### Step 3. Assign it to a variable called regiment.
#### Don't forget to name each column

In [None]:
regiment = pd.DataFrame(raw_data, columns = raw_data.keys())


### Step 4. What is the mean preTestScore from the regiment Nighthawks?  

### Step 5. Present general statistics by company

### Step 6. What is the mean each company's preTestScore?

### Step 7. Present the mean preTestScores grouped by regiment and company

### Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing

### Step 9. Group the entire dataframe by regiment and company

### Step 10. What is the number of observations in each regiment and company

### Step 11. Iterate over a group and print the name and the whole data from the regiment

In [None]:
# Group the dataframe by regiment, and for each regiment,


    # print the name of the regiment

    # print the data of that regiment
