# Data Manipulation with Panda

# Chapter 1: DataFrames and Transforming Data

### Pandas

Python package for data manipulation and Viz.

are built on top of Numpy and Matplotlib

Are designed to work with `rectangular data` or `Tabular Data`

In Panadas the rectanglar data is represented as`DataFrme`
 
### Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do `is explore it and see what it contains`. 

There are several `useful methods and attributes` for this.

`.head()` returns the first few rows (the “head” of the DataFrame).
    
`.info()` shows information on each of the columns, such as the data type and number of missing values.
    
`.shape` returns the number of rows and columns of the DataFrame.
    
`.describe()` calculates a few summary statistics for each column.

#### `homelessness` - DataFrame
    
homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018.

The `individual` column is the number of homeless individuals not part of a family with children.

The `family_members` column is the number of homeless individuals part of a family with children. 

The `state_pop` column is the state's total population.

In [214]:
import pandas as pd

In [215]:
homelessness = pd.read_csv('homelessness.csv')

In [216]:
# Print the head of the homelessness data
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570,864,4887681
1,Pacific,Alaska,1434,582,735139
2,Mountain,Arizona,7259,2606,7158024
3,West South Central,Arkansas,2280,432,3009733
4,Pacific,California,109008,20964,39461588


In [217]:
# Print information about homelessness
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
region            51 non-null object
state             51 non-null object
individuals       51 non-null int64
family_members    51 non-null int64
state_pop         51 non-null int64
dtypes: int64(3), object(2)
memory usage: 2.1+ KB


In [218]:
# Print the shape of homelessness
homelessness.shape

(51, 5)

In [219]:
# Print a description of homelessness
homelessness.describe()

Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0
mean,7225.784314,3500.960784,6405637.0
std,15991.025083,7806.911612,7327258.0
min,434.0,75.0,577601.0
25%,1446.5,550.5,1777414.0
50%,3082.0,1482.0,4461153.0
75%,6781.5,3196.0,7340946.0
max,109008.0,52070.0,39461590.0


### Parts of a DataFrame

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

`.values`: A two-dimensional NumPy array of values.

`.columns`: An index of columns: the column names.

`.index`: An index for the rows: either row numbers or row names.

You can usually think of `indexes` as being like a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

`homelessness` is available.

In [220]:
# Import pandas using the alias pd
import pandas as pd
homelessness.head(2)

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570,864,4887681
1,Pacific,Alaska,1434,582,735139


In [221]:
# Print the values of homelessness
homelessness.values[:2]

array([['East South Central', 'Alabama', 2570, 864, 4887681],
       ['Pacific', 'Alaska', 1434, 582, 735139]], dtype=object)

In [222]:
# Print the column index of homelessness
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

In [223]:
# Print the row index of homelessness
homelessness.index

RangeIndex(start=0, stop=51, step=1)

## 1.1 Sorting and subsetting

### Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to `.sort_values()`.

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

    Sort on             	Syntax
    
    one column----------------->df.sort_values("breed")
    multiple columns----------->df.sort_values(["breed", "weight_kg"])
    
By combining `.sort_values()` with `.head()`, you can answer questions in the form, "What are the top cases where...?".

Import `homelessness` and pandas as pd.

In [224]:
# Sort homelessness by individual
homelessness_ind = homelessness.sort_values('individuals')

# Print the top few rows
homelessness_ind.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434,205,577601
34,West North Central,North Dakota,467,75,758080
7,South Atlantic,Delaware,708,374,965479
39,New England,Rhode Island,747,354,1058287
45,New England,Vermont,780,511,624358


In [225]:
# Sort homelessness by descending family members
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570,864,4887681
1,Pacific,Alaska,1434,582,735139
2,Mountain,Arizona,7259,2606,7158024
3,West South Central,Arkansas,2280,432,3009733
4,Pacific,California,109008,20964,39461588


In [228]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values('family_members')

# Print the top few rows
homelessness_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
34,West North Central,North Dakota,467,75,758080
50,Mountain,Wyoming,434,205,577601
48,South Atlantic,West Virginia,1021,222,1804291
41,West North Central,South Dakota,836,323,878698
24,East South Central,Mississippi,1024,328,2981020


In [229]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'],ascending = [True, False])

# Print the top few rows
homelessness_reg_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
13,East North Central,Illinois,6752,3891,12723071
35,East North Central,Ohio,6929,3320,11676341
22,East North Central,Michigan,5209,3142,9984072
49,East North Central,Wisconsin,2740,2167,5807406
14,East North Central,Indiana,3776,1482,6695497


### Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square-brackets (`[]`) can be used to select only the columns that matter to you in an order that makes sense to you. 

To select only `"col_a"` of the DataFrame `df`, use

    df["col_a"]
    
To select `"col_a"` and `"col_b"` of `df`, use

    df[["col_a", "col_b"]]

In [231]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570,864,4887681
1,Pacific,Alaska,1434,582,735139
2,Mountain,Arizona,7259,2606,7158024
3,West South Central,Arkansas,2280,432,3009733
4,Pacific,California,109008,20964,39461588


In [230]:
# Select the individuals column
individuals = homelessness["individuals"]

# Print top 5 of the result
individuals.head() # This is pandas series

0      2570
1      1434
2      7259
3      2280
4    109008
Name: individuals, dtype: int64

In [235]:
# Select the state and family_members columns
# pass a list of column names

state_fam = homelessness[["state","family_members"]]

# Print the head of the result
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864
1,Alaska,582
2,Arizona,2606
3,Arkansas,432
4,California,20964


In [234]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals","state" ]]

# Print the head of the result
ind_state.head()

Unnamed: 0,individuals,state
0,2570,Alabama
1,1434,Alaska
2,7259,Arizona
3,2280,Arkansas
4,109008,California


### Subsetting rows

A large part of data science is about finding which bits of your dataset are **interesting**. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, `1.` perhaps the most common is to use relational operators to return `True` or `False` for each row, then pass that inside square brackets.

    dogs[dogs["height_cm"] > 60]
    dogs[dogs["color"] == "tan"]
    
You can filter for multiple conditions at once by using the `"logical and"` operator, `&`.

    dogs[(dogs["height_cm"] > 60) & (dogs["col_b"] == "tan")]
    

In [246]:
#*******************SUPER COOL EXERCISE***********************************
# What are states/regions with homeless individuals > 1k and sort by individuals
#Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] >10000]

# See the result
ind_gt_10k.sort_values("individuals",ascending = False)

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008,20964,39461588
32,Mid-Atlantic,New York,39827,52070,19530351
9,South Atlantic,Florida,21443,9587,21244317
43,West South Central,Texas,19199,6111,28628666
47,Pacific,Washington,16424,5880,7523869
37,Pacific,Oregon,11139,3337,4181886


In [50]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == 'Mountain']

# See the result
mountain_reg

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259,2606,7158024
5,Mountain,Colorado,7607,3250,5691287
12,Mountain,Idaho,1297,715,1750536
26,Mountain,Montana,983,422,1060665
28,Mountain,Nevada,7058,486,3027341
31,Mountain,New Mexico,1949,602,2092741
44,Mountain,Utah,1904,972,3153550
50,Mountain,Wyoming,434,205,577601


In [53]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] <1000) & (homelessness["region"] == "Pacific") ]

# See the result
fam_lt_1k_pac

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434,582,735139


### Subsetting rows by categorical variables: `.isin()` method

Subsetting data based on a categorical variable often involves using the "or" operator (`|`) to select rows from multiple categories. **This can get tedious** when you want all states in one of three different regions, for example. **Instead**, use the `.isin()` method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

    colors = ["brown", "black", "tan"]
    condition = dogs["color"].isin(colors)
    dogs[condition]

In [58]:
#Using or oprator |
south_mid_atlantic = homelessness[(homelessness["region"] == "South Atlantic") | (homelessness["region"] == "Mid-Atlantic") ]
south_mid_atlantic

Unnamed: 0,region,state,individuals,family_members,state_pop
7,South Atlantic,Delaware,708,374,965479
8,South Atlantic,District of Columbia,3770,3134,701547
9,South Atlantic,Florida,21443,9587,21244317
10,South Atlantic,Georgia,6943,2556,10511131
20,South Atlantic,Maryland,4914,2230,6035802
30,Mid-Atlantic,New Jersey,6048,3350,8886025
32,Mid-Atlantic,New York,39827,52070,19530351
33,South Atlantic,North Carolina,6451,2817,10381615
38,Mid-Atlantic,Pennsylvania,8163,5349,12800922
40,South Atlantic,South Carolina,3082,851,5084156


In [64]:
#using .isin() method
# Subset for rows in South Atlantic or Mid-Atlantic regions
areas = ["South Atlantic", "Mid-Atlantic" ]
condition = homelessness["region"].isin(areas)
south_mid_atlantic = homelessness[condition]

# See the result
south_mid_atlantic

Unnamed: 0,region,state,individuals,family_members,state_pop
7,South Atlantic,Delaware,708,374,965479
8,South Atlantic,District of Columbia,3770,3134,701547
9,South Atlantic,Florida,21443,9587,21244317
10,South Atlantic,Georgia,6943,2556,10511131
20,South Atlantic,Maryland,4914,2230,6035802
30,Mid-Atlantic,New Jersey,6048,3350,8886025
32,Mid-Atlantic,New York,39827,52070,19530351
33,South Atlantic,North Carolina,6451,2817,10381615
38,Mid-Atlantic,Pennsylvania,8163,5349,12800922
40,South Atlantic,South Carolina,3082,851,5084156


In [66]:
#We can make it compact by substituting
#using .isin() method
# Subset for rows in South Atlantic or Mid-Atlantic regions
 
south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic", "Mid-Atlantic" ])]

# See the result
south_mid_atlantic

Unnamed: 0,region,state,individuals,family_members,state_pop
7,South Atlantic,Delaware,708,374,965479
8,South Atlantic,District of Columbia,3770,3134,701547
9,South Atlantic,Florida,21443,9587,21244317
10,South Atlantic,Georgia,6943,2556,10511131
20,South Atlantic,Maryland,4914,2230,6035802
30,Mid-Atlantic,New Jersey,6048,3350,8886025
32,Mid-Atlantic,New York,39827,52070,19530351
33,South Atlantic,North Carolina,6451,2817,10381615
38,Mid-Atlantic,Pennsylvania,8163,5349,12800922
40,South Atlantic,South Carolina,3082,851,5084156


In [67]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
mojave_homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259,2606,7158024
4,Pacific,California,109008,20964,39461588
28,Mountain,Nevada,7058,486,3027341
44,Mountain,Utah,1904,972,3153550


## 1.2 Adding new columns

You aren't stuck with just the data you are given. Instead, **you can add new columns** to a DataFrame. This has many names, such as `transforming`, `mutating`, and `feature engineering`.

You can create new columns **from scratch**, but it is **also common to derive them from other columns**, for example, by adding columns together, or by changing their units.


In [76]:
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_individuals col as proportion of individuals
homelessness['p_individuals'] = homelessness['individuals'] / homelessness['total']

# See the result
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals
0,East South Central,Alabama,2570,864,4887681,3434,0.748398
1,Pacific,Alaska,1434,582,735139,2016,0.71131
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834
3,West South Central,Arkansas,2280,432,3009733,2712,0.840708
4,Pacific,California,109008,20964,39461588,129972,0.838704


### Combo-attack!

You've seen `the four most common types of data manipulation`: 
    
    1.sorting rows, 
    2.subsetting columns, 
    3.subsetting rows, and 
    4.adding new columns. 

In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" `Combine your new pandas skills` to find out.

In [77]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals
0,East South Central,Alabama,2570,864,4887681,3434,0.748398
1,Pacific,Alaska,1434,582,735139,2016,0.71131
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834
3,West South Central,Arkansas,2280,432,3009733,2712,0.840708
4,Pacific,California,109008,20964,39461588,129972,0.838704


In [88]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
0,East South Central,Alabama,2570,864,4887681,3434,0.748398,5.258117
1,Pacific,Alaska,1434,582,735139,2016,0.71131,19.506515
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834,10.141067
3,West South Central,Arkansas,2280,432,3009733,2712,0.840708,7.575423
4,Pacific,California,109008,20964,39461588,129972,0.838704,27.623825


In [89]:

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
high_homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
4,Pacific,California,109008,20964,39461588,129972,0.838704,27.623825
8,South Atlantic,District of Columbia,3770,3134,701547,6904,0.54606,53.738381
11,Pacific,Hawaii,4131,2399,1420593,6530,0.632619,29.079406
28,Mountain,Nevada,7058,486,3027341,7544,0.935578,23.314189
32,Mid-Atlantic,New York,39827,52070,19530351,91897,0.433387,20.392363
37,Pacific,Oregon,11139,3337,4181886,14476,0.769481,26.636307
47,Pacific,Washington,16424,5880,7523869,22304,0.73637,21.829195


In [90]:
# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)
high_homelessness_srt

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
8,South Atlantic,District of Columbia,3770,3134,701547,6904,0.54606,53.738381
11,Pacific,Hawaii,4131,2399,1420593,6530,0.632619,29.079406
4,Pacific,California,109008,20964,39461588,129972,0.838704,27.623825
37,Pacific,Oregon,11139,3337,4181886,14476,0.769481,26.636307
28,Mountain,Nevada,7058,486,3027341,7544,0.935578,23.314189
47,Pacific,Washington,16424,5880,7523869,22304,0.73637,21.829195
32,Mid-Atlantic,New York,39827,52070,19530351,91897,0.433387,20.392363


In [91]:
# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]
# See the result
result

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363


# Chapter 2: Aggregating Data

## 2.1 Summary statistics

#### Mean and median

`Summary statistics` are exactly what they sound like - they summarize `many numbers in one statistic`. 

For example, mean, median, minimum, maximum, and standard deviation are summary statistics.

Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

In [94]:
# Print the head of the sales DataFrame
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
0,East South Central,Alabama,2570,864,4887681,3434,0.748398,5.258117
1,Pacific,Alaska,1434,582,735139,2016,0.71131,19.506515
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834,10.141067
3,West South Central,Arkansas,2280,432,3009733,2712,0.840708,7.575423
4,Pacific,California,109008,20964,39461588,129972,0.838704,27.623825


In [97]:
# Print the mean of of homless individyals 
print(homelessness['individuals'].mean())

7225.78431372549


In [99]:

# Print the median of weekly_sales
print(homelessness['individuals'].median())

3082.0


### Summarizing dates

Summary statistics can also be calculated on date columns which have values with the data type `datetime64`. 

Some summary statistics — like `mean` — don't make a ton of sense on dates, but others are super helpful, for example `minimum and maximum`, which `allow you to see what time range` your data covers.

In [100]:
# Print the maximum of the date column
print(homelessness["total"].max())

129972


In [102]:

# Print the minimum of the date column
print(homelessness["total"].min())

542


### Efficient summaries

While pandas and NumPy have tons of functions, sometimes you may need a different function to summarize your data.

The `.agg()` method **allows you to apply your own custom functions** to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations **super efficient**.

In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

In [106]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(homelessness['p_individuals'].agg(iqr))

0.1193842720534477


In [107]:
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(homelessness[["individuals", "family_members", "state_pop"]].agg(iqr))

individuals          5335.0
family_members       2645.5
state_pop         5563533.0
dtype: float64


In [130]:
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(homelessness[["individuals", "family_members", "state_pop"]].agg([iqr, np.median, np.mean, np.min]))

        individuals  family_members     state_pop
iqr     5335.000000     2645.500000  5.563533e+06
median  3082.000000     1482.000000  4.461153e+06
mean    7225.784314     3500.960784  6.405637e+06
amin     434.000000       75.000000  5.776010e+05


### Cumulative statistics

Cumulative statistics can also be helpful in tracking `summary statistics` `over time`. 

In this exercise, you'll calculate the cumulative sum and cumulative max.

In [131]:
# Sort sales_1_1 by date
homelessness=homelessness.sort_values("state")

# Get the cumulative sum of homless individuals, add as cum_individuals col
homelessness["cum_individuals"] = homelessness["individuals"].cumsum()

# Get the cumulative max of individulas, add as cum_max_individulas col
homelessness["cum_max_individuals"] = homelessness["individuals"].cummax()

# See the columns you calculated
homelessness[["state", "individuals", "cum_individuals", "cum_max_individuals"]].head(6)

Unnamed: 0,state,individuals,cum_individuals,cum_max_individuals
0,Alabama,2570,2570,2570
1,Alaska,1434,4004,2570
2,Arizona,7259,11263,7259
3,Arkansas,2280,13543,7259
4,California,109008,122551,109008
5,Colorado,7607,130158,109008


## 2.2 Counting

#### Dropping duplicates

Removing duplicates is an essential skill to get accurate counts, because often you don't want to count the same thing multiple times. 

In [132]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k,cum_individuals,cum_max_individuals
0,East South Central,Alabama,2570,864,4887681,3434,0.748398,5.258117,2570,2570
1,Pacific,Alaska,1434,582,735139,2016,0.71131,19.506515,4004,2570
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834,10.141067,11263,7259
3,West South Central,Arkansas,2280,432,3009733,2712,0.840708,7.575423,13543,7259
4,Pacific,California,109008,20964,39461588,129972,0.838704,27.623825,122551,109008


In [135]:
# Drop duplicate region combinations
drop_dup_reg = homelessness.drop_duplicates('region')
drop_dup_reg

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k,cum_individuals,cum_max_individuals
0,East South Central,Alabama,2570,864,4887681,3434,0.748398,5.258117,2570,2570
1,Pacific,Alaska,1434,582,735139,2016,0.71131,19.506515,4004,2570
2,Mountain,Arizona,7259,2606,7158024,9865,0.735834,10.141067,11263,7259
3,West South Central,Arkansas,2280,432,3009733,2712,0.840708,7.575423,13543,7259
6,New England,Connecticut,2280,1696,3571520,3976,0.573441,6.383837,132438,109008
7,South Atlantic,Delaware,708,374,965479,1082,0.654344,7.333148,133146,109008
13,East North Central,Illinois,6752,3891,12723071,10643,0.634408,5.306895,177482,109008
15,West North Central,Iowa,1711,1038,3148618,2749,0.622408,5.43413,182969,109008
30,Mid-Atlantic,New Jersey,6048,3350,8886025,9398,0.643541,6.806193,233533,109008


In [136]:
# Drop duplicate store/department combinations
#store_depts = sales.drop_duplicates(subset=['store', 'department'])
#print(store_depts.head())

# Subset the rows that are holiday weeks and drop duplicate dates
#holiday_dates = sales[sales['is_holiday']].drop_duplicates('date')

# Print date col of holiday_dates
#print(holiday_dates)

### Counting categorical variables

`Counting` is a great way `to get an overview of your data` and to spot curiosities that you might not notice otherwise.

In [138]:
# Count the number of stores of each type
#store_counts = stores["type"].value_counts()
#print(store_counts)

# Get the proportion of stores of each type
#store_props = stores["type"].value_counts(normalize = True)
#print(store_props)

# Count the number of departments of each type and sort
#dept_counts_sorted = departments['department'].value_counts(sort=True)
#print(dept_counts_sorted)

# Get the proportion of departments of each type and sort
#dept_props_sorted = departments['department'].value_counts(sort=True, normalize= True)
#print(dept_props_sorted)

## 2.3 Grouped summary statistics

### What percent of sales occurred at each store type?

While `.groupby()` is useful, you can calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: "supercenters", "discount stores", and "neighborhood markets", encoded in this dataset as type "A", "B", and "C". In this exercise, you'll calculate the total sales made at each store type, without using .groupby(). You can then use these numbers to see what proportion of Walmart's total sales were made at each.

In [143]:
# Calc total weekly sales
#sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
#sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
#sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
#sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
#sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
#print(sales_propn_by_type)

### Calculations with .groupby()

The `.groupby()` method **makes life much easier**. 

In this exercise, you'll perform the same calculations as last time, except you'll use the .groupby() method. 

You'll also perform calculations on data grouped by two variables to see if sales differs by store type depending on if it's a holiday week or not.

sales is available and pandas is loaded as pd.

In [144]:
# Group by type; calc total weekly sales

#sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
#sales_propn_by_type = sales_by_type/sales["weekly_sales"].sum()
#print(sales_propn_by_type)

### Multiple grouped summaries

Earlier in this chapter you saw that the `.agg()` method is useful to compute multiple statistics on multiple variables. 

It also works with grouped data. NumPy, which is imported as np, has many different summary statistics functions, including:

    np.min()
    np.max()
    np.mean()
    np.median()

In [145]:
# Import NumPy with the alias np
#import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean, and median
#sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min,np.max,np.mean,np.median])

# Print sales_stats
#print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
#unemp_fuel_stats = sales.groupby("type")[["unemployment","fuel_price_usd_per_l"]].agg([np.min,np.max,np.mean,np.median])

# Print unemp_fuel_stats
#print(unemp_fuel_stats)

### Pivot tables

### Pivoting on one variable

`Pivot tables` are **the standard way of aggregating data* in spreadsheets. 

In pandas, pivot tables are essentially just **another way of performing grouped calculations**. That is, the `.pivot_table()` method is just an alternative to `.groupby()`.

In this exercise, you'll perform calculations using `.pivot_table()` to replicate the calculations you performed in the last lesson using `.groupby()`.

sales is available and pandas is imported as pd.

In [146]:
# Pivot for mean weekly_sales for each store type
#mean_sales_by_type = sales.pivot_table(values="weekly_sales", index="type") 

# Print mean_sales_by_type
#print(mean_sales_by_type)

In [147]:
# Import NumPy as np
#import numpy as np

# Pivot for mean and median weekly_sales for each store type
#mean_med_sales_by_type = sales.pivot_table(values="weekly_sales", index="type", aggfunc=[np.mean, np.median] )

# Print mean_med_sales_by_type
#print(mean_med_sales_by_type)

In [149]:
# Pivot for mean weekly_sales by store type and holiday 
#mean_sales_by_type_holiday = sales.pivot_table(values="weekly_sales", index="type", columns="is_holiday")

# Print mean_sales_by_type_holiday
#print(mean_sales_by_type_holiday)

In [150]:
# Print mean weekly_sales by department and type; fill missing values with 0
#print(sales.pivot_table(values="weekly_sales" ,index="department", columns="type", fill_value=0))

In [151]:
# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
#print(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value=0, margins=True))

### Fill in missing values and sum values with pivot tables

The `.pivot_table()` method has several useful arguments, including `fill_value` and `margins`.

`fill_value` *replaces missing values* with a real value **(known as imputation)**. What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute **a dummy value**.

`margins` is a shortcut for when you pivoted *by two variables*, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

In this exercise, you'll practice using these arguments to up your pivot table skills, which will help you crunch numbers more efficiently!

sales is available and pandas is imported as pd.

# Chapter 3: Slicing and Indexing Data

## 3.1 Subsetting using slicing

## 3.2 Indexes and subsetting using indexes

# Chapter 4: Creating and Visualizing Data

## 4.1 Plotting

## 4.2 Handling missing data

## 4.3 Reading data into a DataFrame