# Data Manipulation with Pandas

## Introducing DataFrames

Pandas built on matplotlib and numpy
Designed to work with rectangular data as a dataframe
Every value in a column has the same data type
But columns can contain any Types

object.head()
    Returns the specified number of rows
object.shape - an attribute, so does not need parentheses
    Returns the number of rows and columns in the dataframe
object.describe()
    Returns a brief description of the dataframe
object.columns
    Contains column names
object.index    
    Contains row numbers or row names

###  Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

- .head() returns the first few rows (the “head” of the DataFrame).
- .info() shows information on each of the columns, such as the data type and number of missing values.
- .shape returns the number of rows and columns of the DataFrame.
- .describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

#### Instructions
pandas is imported for you.

Print the head of the homelessness DataFrame.

Print information about the column types and missing values in homelessness.

Print the number of rows and columns in homelessness.

Print some summary statistics that describe the homelessness DataFrame.

In [39]:
import pandas as pd
homelessness = pd.read_csv("homelessness.csv")

In [40]:
# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())

               region       state  individuals  family_members  state_pop
0  East South Central     Alabama       2570.0           864.0    4887681
1             Pacific      Alaska       1434.0           582.0     735139
2            Mountain     Arizona       7259.0          2606.0    7158024
3  West South Central    Arkansas       2280.0           432.0    3009733
4             Pacific  California     109008.0         20964.0   39461588
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB
None
(51, 5)
         individuals  family_members     state_pop
count      51

### Parts of a DataFrame

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

- .values: A two-dimensional NumPy array of values.
- .columns: An index of columns: the column names.
- .index: An index for the rows: either row numbers or row names.
- 
You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

homelessness is available.

#### Instructions

Import pandas using the alias pd.
Print a 2D NumPy array of the values in homelessness.
Print the column names of homelessness.
Print the index of homelessness.

In [41]:
# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)

# Print the column index of homelessness
print(homelessness.columns)

# Print the row index of homelessness
print(homelessness.index)

[['East South Central' 'Alabama' 2570.0 864.0 4887681]
 ['Pacific' 'Alaska' 1434.0 582.0 735139]
 ['Mountain' 'Arizona' 7259.0 2606.0 7158024]
 ['West South Central' 'Arkansas' 2280.0 432.0 3009733]
 ['Pacific' 'California' 109008.0 20964.0 39461588]
 ['Mountain' 'Colorado' 7607.0 3250.0 5691287]
 ['New England' 'Connecticut' 2280.0 1696.0 3571520]
 ['South Atlantic' 'Delaware' 708.0 374.0 965479]
 ['South Atlantic' 'District of Columbia' 3770.0 3134.0 701547]
 ['South Atlantic' 'Florida' 21443.0 9587.0 21244317]
 ['South Atlantic' 'Georgia' 6943.0 2556.0 10511131]
 ['Pacific' 'Hawaii' 4131.0 2399.0 1420593]
 ['Mountain' 'Idaho' 1297.0 715.0 1750536]
 ['East North Central' 'Illinois' 6752.0 3891.0 12723071]
 ['East North Central' 'Indiana' 3776.0 1482.0 6695497]
 ['West North Central' 'Iowa' 1711.0 1038.0 3148618]
 ['West North Central' 'Kansas' 1443.0 773.0 2911359]
 ['East South Central' 'Kentucky' 2735.0 953.0 4461153]
 ['West South Central' 'Louisiana' 2540.0 519.0 4659690]
 ['New 

## Sorting and Subsetting

### Sorting
- Sort ascending
    object.sort_values
- Sort descending
    object.sort_values(field, ascending = False)
- Sort on more than one field ascending on both
    object.sort_values(field, field2)
- Sort on more than one field and ascending on first, and descending on the second
    object.sort_values([field, field2], ascending = [True, False])

### Subsetting Columns

- object["column"]
    Returns all the values in the column
- object[["column1", "column2"]]
    Returns all the values from both columns

### Subsetting Rows

- object["field"] > targetNumber
    Returns list indicating True or False for each row

- object[object["field"] > targetNumber]
    Returns all the field values that meet the condition
    
- object[object["field] == "string"]
    Returns all objects where field is equal to string
    Also works with dates in "yyyy-mm-dd" format

- object[ (object["field"] == "value1") & (object["field2"] == "value2") ]
    Returns rows that meet both conditions

- is_value1_or_value2 = object["field"].isin(["value1", "value2"])
    object[is_value1_or_value2]
    Returns series of booleans indicating whether the row meet either conditions
    
-is_value123or4 = object[object["field"].isin(["value1", "value2", "value3", "value4"])]
    Returns the dataframe witht he rows that meet any conditions
### Instructions

#### Sorting rows
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort homelessness by the number of homeless individuals, from smallest to largest, and save this as homelessness_ind.

Print the head of the sorted DataFrame.

Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.

Print the head of the sorted DataFrame.

Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.

Print the head of the sorted DataFrame.

In [42]:
# Sort homelessness by individual
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
print(homelessness_ind.head())

# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending = False)

# Print the top few rows
print(homelessness_fam.head())

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending = [True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

                region         state  individuals  family_members  state_pop
50            Mountain       Wyoming        434.0           205.0     577601
34  West North Central  North Dakota        467.0            75.0     758080
7       South Atlantic      Delaware        708.0           374.0     965479
39         New England  Rhode Island        747.0           354.0    1058287
45         New England       Vermont        780.0           511.0     624358
                region          state  individuals  family_members  state_pop
32        Mid-Atlantic       New York      39827.0         52070.0   19530351
4              Pacific     California     109008.0         20964.0   39461588
21         New England  Massachusetts       6811.0         13257.0    6882635
9       South Atlantic        Florida      21443.0          9587.0   21244317
43  West South Central          Texas      19199.0          6111.0   28628666
                region      state  individuals  family_members  state_

#### Subsetting columns
When working with data, you may not need all of the variables in your dataset. 

Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. 

To select only "col_a" of the DataFrame df, use df["col_a"]

To select "col_a" and "col_b" of df, use df[["col_a", "col_b"]]

homelessness is available and pandas is loaded as pd.

Create a DataFrame called individuals that contains only the individuals column of homelessness.

Print the head of the result.

Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.

Print the head of the result.

Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.

Print the head of the result.

In [43]:
# Select the individuals column
individuals = homelessness["individuals"]

# Print the head of the result
print(individuals.head())

# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]

# Print the head of the result
print(state_fam.head())

# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals", "state"]]

# Print the head of the result
print(ind_state.head())

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64
        state  family_members
0     Alabama           864.0
1      Alaska           582.0
2     Arizona          2606.0
3    Arkansas           432.0
4  California         20964.0
   individuals       state
0       2570.0     Alabama
1       1434.0      Alaska
2       7259.0     Arizona
3       2280.0    Arkansas
4     109008.0  California


#### Subsetting rows
A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]
dogs[dogs["color"] == "tan"]
You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
homelessness is available and pandas is loaded as pd.

Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.

Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.

Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.

In [44]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] > 10000]

# See the result
print(ind_gt_10k)

# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness["region"] == "Mountain"]

# See the result
print(mountain_reg)

# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]

# See the result
print(fam_lt_1k_pac)

                region       state  individuals  family_members  state_pop
4              Pacific  California     109008.0         20964.0   39461588
9       South Atlantic     Florida      21443.0          9587.0   21244317
32        Mid-Atlantic    New York      39827.0         52070.0   19530351
37             Pacific      Oregon      11139.0          3337.0    4181886
43  West South Central       Texas      19199.0          6111.0   28628666
47             Pacific  Washington      16424.0          5880.0    7523869
      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
5   Mountain    Colorado       7607.0          3250.0    5691287
12  Mountain       Idaho       1297.0           715.0    1750536
26  Mountain     Montana        983.0           422.0    1060665
28  Mountain      Nevada       7058.0           486.0    3027341
31  Mountain  New Mexico       1949.0           602.0    2092741
44  Mountain        

#### Subsetting rows by categorical variables
Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. 

This can get tedious when you want all states in one of three different regions, for example. 

Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]
homelessness is available and pandas is loaded as pd.

Filter homelessness for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to south_mid_atlantic. View the printed result.

Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.

In [45]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[(homelessness["region"] == "South Atlantic") | (homelessness["region"] == "Mid-Atlantic")]

# See the result
print(south_mid_atlantic)

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)

            region                 state  individuals  family_members  \
7   South Atlantic              Delaware        708.0           374.0   
8   South Atlantic  District of Columbia       3770.0          3134.0   
9   South Atlantic               Florida      21443.0          9587.0   
10  South Atlantic               Georgia       6943.0          2556.0   
20  South Atlantic              Maryland       4914.0          2230.0   
30    Mid-Atlantic            New Jersey       6048.0          3350.0   
32    Mid-Atlantic              New York      39827.0         52070.0   
33  South Atlantic        North Carolina       6451.0          2817.0   
38    Mid-Atlantic          Pennsylvania       8163.0          5349.0   
40  South Atlantic        South Carolina       3082.0           851.0   
46  South Atlantic              Virginia       3928.0          2047.0   
48  South Atlantic         West Virginia       1021.0           222.0   

    state_pop  
7      965479  
8      701547  
9 

## New Columns

AKA
mutating, transforming, feature engineering

object["newfield"] = object["originalField"] / 100
"newField" is added to object as a column/field with the values of the original field divided by 100.

Multiple manipulations

- First get the skinny dogs
bmi_lt_100 = dogs[dogs["bmi"] < 100]

- Next sort the result descending by height
bmi_lt_100_height  = bmi_lt_100.sort_values("height_cm", ascending=False)

- keep the columns you are interested in
bmi_lt_100_height[["name", "height_cm", "bmi"]]

### Instructions

#### Adding new columns
You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

homelessness is available and pandas is loaded as pd.

Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.

Add another column to homelessness, named p_individuals, containing the proportion of homeless people in each state who are individuals.

In [46]:
# Add total col as sum of individuals and family_members
homelessness["total"]  = homelessness["individuals"] + homelessness["family_members"]

# Add p_individuals col as proportion of individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]

# See the result
print(homelessness)

                region                 state  individuals  family_members  \
0   East South Central               Alabama       2570.0           864.0   
1              Pacific                Alaska       1434.0           582.0   
2             Mountain               Arizona       7259.0          2606.0   
3   West South Central              Arkansas       2280.0           432.0   
4              Pacific            California     109008.0         20964.0   
5             Mountain              Colorado       7607.0          3250.0   
6          New England           Connecticut       2280.0          1696.0   
7       South Atlantic              Delaware        708.0           374.0   
8       South Atlantic  District of Columbia       3770.0          3134.0   
9       South Atlantic               Florida      21443.0          9587.0   
10      South Atlantic               Georgia       6943.0          2556.0   
11             Pacific                Hawaii       4131.0          2399.0   

#### Combo-attack!
You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new pandas skills to find out.

Add a column to homelessness, indiv_per_10k, containing the number of homeless individuals per ten thousand people in each state.

Subset rows where indiv_per_10k is higher than 20, assigning to high_homelessness.

Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.

Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result. Look at the result.

In [47]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)

                   state  indiv_per_10k
8   District of Columbia      53.738381
11                Hawaii      29.079406
4             California      27.623825
37                Oregon      26.636307
28                Nevada      23.314189
47            Washington      21.829195
32              New York      20.392363


## Summary Statistics

- .median(), .mode()
- .min(), .max()
- .var(),.std()
- .sum()
- .quantile()

Oldest dog:
dogs["date_of_birth"].min()

Youngest dog:
dogs["date_of_birth"].max()'2018-02-27'

### agg()
The aggregate, or agg, method allows you to compute custom summary statistics

def pct30(column):
    return column.quantile(0.3)
    
dogs["weight_kg"].agg(pct30)

#### Summaries on Multiple Columns
dogs[["weight_kg", "height_cm"]].agg(pct30)

#### Multiple Summaries
def pct40(column):
    return column.quantile(0.4)
    
dogs["weight_kg"].agg([pct30, pct40])

#### Cumulative Summary

dogs["weight_kg"].cumsum()

#### Other Cumulative Statistics

- .cummax()
- .cummin()
- .cumprod()

### Instructions

#### Mean and median
Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. 

For example, mean, median, minimum, maximum, and standard deviation are summary statistics. 

Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

sales is available and pandas is loaded as pd.

Explore your new DataFrame first by printing the first few rows of the sales DataFrame.

Print information about the columns in sales.

Print the mean of the weekly_sales column.

Print the median of the weekly_sales column.


In [48]:
import pandas as pd
sales = pd.read_csv("sales_subset.csv")


# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())

# Print the median of weekly_sales
print(sales["weekly_sales"].median())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------

#### Summarizing dates
Summary statistics can also be calculated on date columns that have values with the data type datetime64. 

Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

sales is available and pandas is loaded as pd.

Print the maximum of the date column.

Print the minimum of the date column.

In [49]:
# Print the maximum of the date column
print(sales["date"].max())

# Print the minimum of the date column
print(sales["date"].min())

2012-10-26
2010-02-05


#### Efficient summaries
While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

df['column'].agg(function)

In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. 
It's an alternative to standard deviation that is helpful if your data contains outliers.

sales is available and pandas is loaded as pd.

Use the custom iqr function defined for you along with .agg() to print the IQR of the temperature_c column of sales.

Update the column selection to use the custom iqr function with .agg() to print the IQR of temperature_c, fuel_price_usd_per_l, and unemployment, in that order.

Update the aggregation functions called by .agg(): include iqr and np.median in that order.

In [50]:
import numpy as np
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales["temperature_c"].agg(iqr))

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

16.583333333333336
temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64
        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


#### Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. 

In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called sales_1_1 has been created for you, which contains the sales data for department 1 of store 1. pandas is loaded as pd.

Sort the rows of sales_1_1 by the date column in ascending order.

Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales.

Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales.

Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.

In [51]:
sales_1_1 = sales[(sales["department"] == 1) & (sales["store"] == 1)]

# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date")

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])


          date  weekly_sales  cum_weekly_sales  cum_max_sales
0   2010-02-05      24924.50          24924.50       24924.50
1   2010-03-05      21827.90          46752.40       24924.50
2   2010-04-02      57258.43         104010.83       57258.43
3   2010-05-07      17413.94         121424.77       57258.43
4   2010-06-04      17558.09         138982.86       57258.43
5   2010-07-02      16333.14         155316.00       57258.43
6   2010-08-06      17508.41         172824.41       57258.43
7   2010-09-03      16241.78         189066.19       57258.43
8   2010-10-01      20094.19         209160.38       57258.43
9   2010-11-05      34238.88         243399.26       57258.43
10  2010-12-03      22517.56         265916.82       57258.43
11  2011-01-07      15984.24         281901.06       57258.43


## Counting

### Dropping duplicatenames

vet_visits.drop_duplicates(subset="name")

### Dropping duplicate pairs

unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])

print(unique_dogs)

### Counts by field
- Unsorted
    unique_dogs["breed"].value_counts()     
- Sorted
    unique_dogs["breed"].value_counts(sort=True)
- Portion of whole
    unique_dogs["breed"].value_counts(normalize=True)

### Instructions

#### Dropping duplicates
Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. 

In this exercise, you'll create some new DataFrames using unique values from sales.

sales is available and pandas is imported as pd.

Remove rows of sales with duplicate pairs of store and type and save as store_types and print the head.

Remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head.

Subset the rows that are holiday weeks using the is_holiday column, and drop the duplicate dates, saving as holiday_dates.

Select the date column of holiday_dates, and print.

In [52]:
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset = ["store", "type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset = ["store", "department"])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset = "date")

# Print date col of holiday_dates
print(holiday_dates["date"])

      Unnamed: 0  store type  department        date  weekly_sales  \
0              0      1    A           1  2010-02-05      24924.50   
901          901      2    A           1  2010-02-05      35034.06   
1798        1798      4    A           1  2010-02-05      38724.42   
2699        2699      6    A           1  2010-02-05      25619.00   
3593        3593     10    B           1  2010-02-05      40212.84   

      is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0          False       5.727778              0.679451         8.106  
901        False       4.550000              0.679451         8.324  
1798       False       6.533333              0.686319         8.623  
2699       False       4.683333              0.679451         7.259  
3593       False      12.411111              0.782478         9.765  
    Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0            0      1    A           1  2010-02-05      24924.50       False   

#### Counting categorical variables
Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. 

In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])

Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])
The store_types and store_depts DataFrames you created in the last exercise are available, and pandas is imported as pd.

Instructions
100 XP
Count the number of stores of each store type in store_types.

Count the proportion of stores of each store type in store_types.

Count the number of different departments in store_depts, sorting the counts in descending order.

Count the proportion of different departments in store_depts, sorting the proportions in descending order.

In [53]:
store_types

Unnamed: 0.1,Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
901,901,2,A,1,2010-02-05,35034.06,False,4.55,0.679451,8.324
1798,1798,4,A,1,2010-02-05,38724.42,False,6.533333,0.686319,8.623
2699,2699,6,A,1,2010-02-05,25619.0,False,4.683333,0.679451,7.259
3593,3593,10,B,1,2010-02-05,40212.84,False,12.411111,0.782478,9.765
4495,4495,13,A,1,2010-02-05,46761.9,False,-0.261111,0.704283,8.316
5408,5408,14,A,1,2010-02-05,32842.31,False,-2.605556,0.735455,8.992
6293,6293,19,A,1,2010-02-05,21500.58,False,-6.133333,0.780365,8.35
7199,7199,20,A,1,2010-02-05,46021.21,False,-3.377778,0.735455,8.187
8109,8109,27,A,1,2010-02-05,32313.79,False,-2.672222,0.780365,8.237


In [54]:
# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize = True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort = True)
print(dept_counts_sorted) 

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

A    11
B     1
Name: type, dtype: int64
A    0.916667
B    0.083333
Name: type, dtype: float64
1     12
55    12
72    12
71    12
67    12
      ..
37    10
48     8
50     6
39     4
43     2
Name: department, Length: 80, dtype: int64
1     0.012917
55    0.012917
72    0.012917
71    0.012917
67    0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: department, Length: 80, dtype: float64


## Grouped Summary Statistics

### Summaries by Group

dogs[dogs["color"] == "Black"]["weight_kg"].mean()
dogs[dogs["color"] == "Brown"]["weight_kg"].mean()

or

dogs.groupby("color")["weight_kg"].mean()

###  Multiple Grouped Summaries
    dogs.groupby("color")["weight_kg"].agg([min, max, sum])

### Grouping by Multiple Variables
    dogs.groupby(["color", "breed"])["weight_kg"].mean()


### Many Groups, Many Summaries
    dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()

### Instructions

#### What percent of sales occurred at each store type?
While .groupby() is useful, you can calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." 

In this exercise, you'll calculate the total sales made at each store type, without using .groupby(). 

You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

sales is available and pandas is imported as pd.

Calculate the total weekly_sales over the whole dataset.

Subset for type "A" stores, and calculate their total weekly sales.

Do the same for type "B" and type "C" stores.

Combine the A/B/C results into a list, and divide by sales_all to get the proportion of sales by type.

In [59]:
# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C]/sales_all
print(sales_propn_by_type)

array([0.9097747, 0.0902253, 0.       ])

#### Calculations with .groupby()
The .groupby() method makes life much easier. 

In this exercise, you'll perform the same calculations as last time, except you'll use the .groupby() method. 

You'll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it's a holiday week or not.

sales is available and pandas is loaded as pd.

Group sales by "type", take the sum of "weekly_sales", and store as sales_by_type.

Calculate the proportion of sales at each store type by dividing by the sum of sales_by_type. Assign to sales_propn_by_type.

Group sales by "type" and "is_holiday", take the sum of weekly_sales, and store as sales_by_type_is_holiday.

In [63]:
# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type/sum(sales_by_type)
print(sales_propn_by_type)

# Group by type and is_holiday; calc total weekly sales
sales_by_type_is_holiday = sales.groupby(["type", "is_holiday"])["weekly_sales"].sum()
print(sales_by_type_is_holiday)

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64


type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64

#### Multiple grouped summaries
Earlier in this chapter, you saw that the .agg() method is useful to compute multiple statistics on multiple variables.

It also works with grouped data. 

NumPy, which is imported as np, has many different summary statistics functions, including: np.min, np.max, np.mean, and np.median.

sales is available and pandas is imported as pd.

Import numpy with the alias np.

Get the min, max, mean, and median of weekly_sales for each store type using .groupby() and .agg(). Store this as sales_stats. Make sure to use numpy functions!

Get the min, max, mean, and median of unemployment and fuel_price_usd_per_l for each store type. Store this as unemp_fuel_stats.

In [72]:
# Import numpy with the alias np
import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

Unnamed: 0_level_0,unemployment,unemployment,unemployment,unemployment,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l
Unnamed: 0_level_1,amin,amax,mean,median,amin,amax,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,3.879,8.992,7.972611,8.067,0.664129,1.10741,0.744619,0.735455
B,7.17,9.765,9.279323,9.199,0.760023,1.107674,0.805858,0.803348


## Pivot Tables

### Group by v Pivot Table

dogs.groupby("color")["weight_kg"].mean()

dogs.pivot_table(values="weight_kg", index="color")

- Values is column to be summarized
- Index is column to group by
- The default is mean
  
### Different Statistics
- Different aggregation can be called with aggfunc agrument in the parentheses
  - dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

### Multiple Statistics
- dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])

### Pivot on Two Variables
- Use the columns argument to pass the second field
  - dogs.pivot_table(values="weight_kg", index="color", columns="breed")
  - The lesson indicates this is the same as group by because the numbers contained therein are the same
    - but only  if we ignore the wholly different data structure and all the NaNs

### Remove Missing Values
- Use the fill_value argument to remove the NaNs
  - dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

### Set Margins to Get Summary Statistics

- Set the margins argument to True and the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with Os.
  - dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)

### Instructions

#### Pivoting on one variable
Pivot tables are the standard way of aggregating data in spreadsheets. <br>
In pandas, pivot tables are essentially just another way of performing grouped calculations. <br>

In this exercise, you'll perform calculations using .pivot_table() to replicate the calculations you performed in the last lesson using .groupby().

Get the mean weekly_sales by type using .pivot_table() and store as mean_sales_by_type.

Get the mean and median (using NumPy functions) of weekly_sales by type using .pivot_table() and store as mean_med_sales_by_type.

Get the mean of weekly_sales by type and is_holiday using .pivot_table() and store as mean_sales_by_type_holiday.

In [76]:
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values = "weekly_sales", index = "type")

# Print mean_sales_by_type
print(mean_sales_by_type)

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values = "weekly_sales", index = "type", aggfunc = [np.mean, np.median])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)
# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = sales.pivot_table(values = "weekly_sales", index = "type", columns = "is_holiday" )

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

      weekly_sales
type              
A     23674.667242
B     25696.678370


is_holiday,False,True
type,Unnamed: 1_level_1,Unnamed: 2_level_1
A,23768.583523,590.04525
B,25751.980533,810.705


#### Fill in missing values and sum values with pivot tables
The .pivot_table() method has several useful arguments, including fill_value and margins.

- fill_value replaces missing values with a real value (known as imputation).
- margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.


Print the mean weekly_sales by department and type, filling in any missing values with 0.

Print the mean weekly_sales by department and type, filling in any missing values with 0 and summing all rows and columns.

In [79]:
sales.pivot_table(values = "weekly_sales", index = "department", columns = "type", fill_value = 0)

sales.pivot_table(values = "weekly_sales", index = "department", columns = "type", fill_value = 0, margins = True)

type,A,B,All
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30961.725379,44050.626667,32052.467153
2,67600.158788,112958.526667,71380.022778
3,17160.002955,30580.655000,18278.390625
4,44285.399091,51219.654167,44863.253681
5,34821.011364,63236.875000,37189.000000
...,...,...,...
96,21367.042857,9528.538333,20337.607681
97,28471.266970,5828.873333,26584.400833
98,12875.423182,217.428333,11820.590278
99,379.123659,0.000000,379.123659
