
# Introduction to Pandas


**<code>[pandas](https://pandas.pydata.org/)</code>** is a library with high-level data structures and manipulation tools:

**DataFrame Object**
    
A Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure that holds ***relational data***.

https://pandas.pydata.org/docs/reference/frame.html

Data is aligned in a tabular fashion with labeled rows and columns like a spreadsheet. A Pandas DataFrame consists of three components: the data, rows, and columns.
***
Key takeaways: 
* Represents a tabular, spreadsheet-like data structure
* Ordered collection of columns
* Each column can be a different value type (numeric, string, boolean, etc.)
* Holds relational data

<img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg" />

<div style="padding: 5px; padding-left: 10px;">
    
<center><h2>Table of Contents</h2></center>
<h4><a href='#introduction'>Introduction</a></h4>
<li>Intro Relational Data
<li>Why use pandas
<li>Resources & documentation

<h4><a href='#imports'>Imports</a></h4>
<h4><a href='#dataload'>Data Loading</a></h4>

<h4><a href='#explore'>Explore the dataset & basic functionality</a></h4>
<li>View the dataframe, how many rows & columns
<li>View specifics about columns
<li>Get data types and unique entries
<li>Descriptive statistics

<h4><a href='#dataselection'>Data Selection, slicing & dicing</a></h4>
<li>Select data by labels/names
<li>Select data by position
<li>“views” vs copies of the data
<li>Conditional selection
<li>Multiple condition selection


## Relational Data

Pandas dataframes/tables typically contain relational data. Relational data is data that captures associations or relationships between data points. This is often expressed as a table with columns indicating the quantities related. For example, "First names" are associated with "Last Names".
    

This is a table with two columns, one column for First Name and one column for Last Name.

<img src="../support_files/images/pandas/pandas_relational_df.png" width='250'>  
    
This is a table with two columns, one column for First Name and one column for Last Name.


For the pedantic, a "relation" is a table with no duplicate entries.  We might, for example, have two John Smiths.  We can try to eliminate this collision problem by introducing an **Index**.

<img src="../support_files/images/pandas/pandas_student_df_with_index.png" width='300'>  
    
**A brief note on row vs index**
In this notebook and lecture we will use the terms row number and index. For many contexts these terms are interchangable, however due to pandas nomenclature around different ways of slicing and dicing data, we will have distinct meanings for these

row number: the number of the row starting with row 0
index: label given to a specific row
 
    
<img src="../support_files/images/pandas/pandas_relation_df_generic.png" width='250'>       
   
In the above table we have added an "Index".  The `DataFrame` object in Python is a representation of a table with these components.  It is composed of rows, each with an index, and labeled columns. 
    
<br>
In a general `DataFrame`, we might have many different relations captured in the same table.  For example, a `DataFrame` with student data from a school might look something like this:
    
<img src="../support_files/images/pandas/pandas_relational_student_df.png" width='500'>  

<br>

**Data Representation**

When thinking about data analysis, note that the above table already gives us something to think about regarding how choices of data representation affect conclusions.  
What if someone only has one name?  
What if they have a name that isn't easily represented as "First name/Last name"?  
What if they come from a culture that keeps track of multiple names?  

When interacting with tables and `DataFrames`, it is important to keep these issues in mind.  Structural choices about data can and will affect conclusions.  Whenever you make a `DataFrame` or use one someone else has constructed for you, you are making or dealing with these kinds of choices.
    
There are standard operations related to questions you might have about the students in the above `DataFrame`. 
<ul>
    <li> Which students took Physics?  </li>
    <li> Which students got an A in any course?  Which Students got an A or B in either Physics or History? </li>
    </ul>

The `DataFrame` object has operations that allow these kinds of questions to be answered in a computationally efficient manner. These are covered in this tutorial.
    
</div>


## Many different relations

If we only needed to subselect from one table, we wouldn't really need something like `pandas` and its `DataFrame` (though it is helpful for this!).  The real power of the `DataFrame` object appears when we have multiple relations, i.e. multiple `DataFrame`s and we wish to combine them in some way.

For example we might have `DataFrame`s that represent student Grades, or Professors, or Schools.

<img src="../support_files/images/pandas/pandas_grade_df.png" width='200'>
    
or a `DataFrame` with Courses offered by Departments in different schools:

<img src="../support_files/images/pandas/pandas_departments_df.png" width='500'>
    
or a `DataFrame` of Professors and the Courses they teach
    
<img src="../support_files/images/pandas/pandas_professors.png" width='500'>
    
   

The main purpose of `pandas` and the `DataFrame` object is to allow us to ask questions across multiple sets of relations.  

    
<ul>
    <li>What is the average score of students of course Y from professor X (who may have taught at different institutions….)?</li>
    <li>What is the average number of students at school X from State Y?</li>
    <li>What is the distribution of grades from students in Biology whose home town is Y?</li>
</ul>
    



`DataFrames` have powerful tools like `merge` to combine information from multiple `DataFrame`s that allow you to ask these kinds of questions quickly and easily.
    
Let's get started!  


<a id='why_use_pandas'></a>

## Why Use Pandas
    
**Annotated Data is Powerful!**

Because pandas combines data labels with values it facilitites easy:
* Dataset exploration
* Dataset visualization
* Basic statistical analysis    

<br/>

**Data manipulation is easy!**

Pandas takes the functionality of slicing & dicing data and layers in many other helpful features that help with:
* Cleaning data
    * Column headers are descriptive not numerical.
    * Columns hold single variables.
    * Variables are stored in either rows or columns, not both.
    * Handling missing or invalid values
* Data wrangling
    * Getting the data into a structure that facilitates your analysis
* Easily loading and saving data
    * Load many formats of data and save in many formats


<br/>
    
**High Level Data manipulation**

Pandas supports vectorized mathematical operations which optimizes computation performance and execution speed. Additionally the 2D labeled tabular structure of a pandas dataframe allows other high level manipulations such as:
* Data grouping aggregation
* Table manipulations (transforming rows to columns &/or columns to rows)
* Merging or Joinng multiple dataframes/tables


### Documentation & Resources
    
This introduction will only just scratch the surface of Pandas functionality. For more information, check out the [full documentation](https://pandas.pydata.org/docs/reference/index.html)
    
This [cheat-sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is also highly recommended to print out and keep handy as a resource.
<p>Or check out the <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html">'10 minutes to Pandas'</a> tutorial here (note: title may mischaracterize time investment).


<a id='imports'></a>

## Import Packages


**Library imports**    
Here we'll load the libraries we'll use to shape and explore the data


In [209]:
import os
import numpy as np

In [210]:
# import `pandas` and give it a short name (or alias) `pd` since we will type it very frequently
import pandas as pd

<a id='dataload'></a>

## Load dataset

Our first step is loading in the data from a file

Pandas has great [tools for automatically interpretting data from many sources](https://pandas.pydata.org/docs/reference/io.html)
    
* [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
* [pd.read_feather](https://pandas.pydata.org/docs/reference/api/pandas.read_feather.html)    
* pd.read_excel
* pd.read_pickle
* pd.read_json
    
 we will use pd.read_feather


In [211]:
filepath = os.path.join('support_files', 'datasets', 'messy_superstore_data.feather')
df = pd.read_feather(filepath)

# ignore this line of code for now- it will be explained a bit later
df.set_index('Row ID', inplace = True)

<a id='explore'></a>

## Explore the dataset

**table of contents**

* <a href='#df'>View dataframe "preview"</a>
* <a href='#.head()'>View rows from beginning or end of dataframe (df.head, df.tail)</a>
* <a href='#shape'>Get dataframe shape and length(len, np.shape)</a>   
* <a href='#.columns'>List all columns(.columns)</a>
* <a href='#dtypes'>Get column data types(.dtypes)</a>
* <a href='#describe'>Get descriptive statistics</a>



<a id='df'></a>

#### View the dataframe

simply calling the dataframe ('df' or whatever you've named the dataframe) gives a preview view of the dataframe/table
* Shows the first 5 and last 5 rows of data
* Shows the first 10 and last 10 columns of data



In [212]:
df

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
IN-2014-23218,IN-2014-75456,Consumer,FURNITURE,,FURNISHINGS,"Rubbermaid Door Stop, Erganomic",FUR-FU-10004064,Afghanistan,APAC,Central Asia,...,,,,,,,,Kabul,Kabul,"[Dunder Mifflin, Globex Corp, Hudsucker Indust..."
IN-2014-24599,IN-2014-29767,Home Office,FURNITURE,,BOOKCASES,"Ikea Library with Doors, Mobile",FUR-BO-10001255,Afghanistan,APAC,Central Asia,...,,,,731.820,,,,Herat,Hirat,"[ACME Co, Buy n Large, Dunder Mifflin, Globex ..."
IN-2014-24597,IN-2014-29767,Home Office,FURNITURE,,FURNISHINGS,"Rubbermaid Door Stop, Erganomic",FUR-FU-10004064,Afghanistan,APAC,Central Asia,...,,,,169.680,,,,Herat,Hirat,"[Dunder Mifflin, LexCorp, Olivander Crafts, Ro..."
IN-2014-27993,IN-2014-20415,Home Office,FURNITURE,,BOOKCASES,"Bush Classic Bookcase, Pine",FUR-BO-10002204,Afghanistan,APAC,Central Asia,...,,2070.15,,,,,,Kabul,Kabul,"[Dunder Mifflin, Olivander Crafts]"
IN-2014-28967,IN-2014-47337,Corporate,FURNITURE,,CHAIRS,"Hon Rocking Chair, Red",FUR-CH-10003965,Afghanistan,APAC,Central Asia,...,914.34,,,,,,,Kabul,Kabul,"[ACME Co, Buy n Large, Dunder Mifflin, LexCorp..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZA-2014-49187,ZA-2014-9750,Corporate,TECHNOLOGY,,ACCESSORIES,"Memorex Router, USB",TEC-MEM-10002202,Zambia,Africa,Africa,...,,,,,,246.42,,Ndola,Copperbelt,"[ACME Co, Royco Waystar, Umbrella Corporation]"
ZI-2014-42069,ZI-2014-7610,Corporate,TECHNOLOGY,,MACHINES,"StarTech Phone, Red",TEC-STA-10000699,Zimbabwe,Africa,Africa,...,,,,21.501,,,,Bulawayo,Bulawayo,[Dunder Mifflin]
ZI-2014-43712,ZI-2014-5970,Home Office,TECHNOLOGY,,ACCESSORIES,"Belkin Router, USB",TEC-BEL-10003985,Zimbabwe,Africa,Africa,...,,,,,,,77.688,Bulawayo,Bulawayo,"[ACME Co, Buy n Large, Hudsucker Industries]"
ZI-2014-48372,ZI-2014-9550,Consumer,TECHNOLOGY,,MACHINES,"Konica Receipt Printer, Red",TEC-KON-10003116,Zimbabwe,Africa,Africa,...,71.64,,,,,,,Bulawayo,Bulawayo,"[ACME Co, Wayne Enterprises]"


<a id='.head()'></a>

#### View the first or last n rows

**[.head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)**: shows the first n rows

**[.tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html)**: shows the last n rows

* <code>df.head()</code> and <code>df.tail()</code> shows 5 rows of data by default
* adding a number <code>df.head(n)</code> adjusts the number of rows shown


In [213]:
# show the first 8 rows
df.head(8)

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
IN-2014-23218,IN-2014-75456,Consumer,FURNITURE,,FURNISHINGS,"Rubbermaid Door Stop, Erganomic",FUR-FU-10004064,Afghanistan,APAC,Central Asia,...,,,,,,,,Kabul,Kabul,"[Dunder Mifflin, Globex Corp, Hudsucker Indust..."
IN-2014-24599,IN-2014-29767,Home Office,FURNITURE,,BOOKCASES,"Ikea Library with Doors, Mobile",FUR-BO-10001255,Afghanistan,APAC,Central Asia,...,,,,731.82,,,,Herat,Hirat,"[ACME Co, Buy n Large, Dunder Mifflin, Globex ..."
IN-2014-24597,IN-2014-29767,Home Office,FURNITURE,,FURNISHINGS,"Rubbermaid Door Stop, Erganomic",FUR-FU-10004064,Afghanistan,APAC,Central Asia,...,,,,169.68,,,,Herat,Hirat,"[Dunder Mifflin, LexCorp, Olivander Crafts, Ro..."
IN-2014-27993,IN-2014-20415,Home Office,FURNITURE,,BOOKCASES,"Bush Classic Bookcase, Pine",FUR-BO-10002204,Afghanistan,APAC,Central Asia,...,,2070.15,,,,,,Kabul,Kabul,"[Dunder Mifflin, Olivander Crafts]"
IN-2014-28967,IN-2014-47337,Corporate,FURNITURE,,CHAIRS,"Hon Rocking Chair, Red",FUR-CH-10003965,Afghanistan,APAC,Central Asia,...,914.34,,,,,,,Kabul,Kabul,"[ACME Co, Buy n Large, Dunder Mifflin, LexCorp..."
AG-2014-50986,AG-2014-2760,Consumer,FURNITURE,,FURNISHINGS,"Deflect-O Light Bulb, Erganomic",FUR-DEF-10002865,Algeria,Africa,Africa,...,,,,,,,17.61,Saida,Saida,"[ACME Co, Hudsucker Industries, Wayne Enterpri..."
AG-2014-50983,AG-2014-2760,Consumer,FURNITURE,,CHAIRS,"Novimex Rocking Chair, Black",FUR-NOV-10002453,Algeria,Africa,Africa,...,,,,,,,516.0,Saida,Saida,"[Dunder Mifflin, Umbrella Corporation]"
AG-2014-49384,AG-2014-2040,Consumer,FURNITURE,,FURNISHINGS,"Rubbermaid Frame, Durable",FUR-RUB-10003004,Algeria,Africa,Africa,...,,,,,,,,Algiers,Alger,"[Buy n Large, Hudsucker Industries, LexCorp, R..."


In [214]:
# show the last 7 rows
df.tail(7)

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ZA-2014-50068,ZA-2014-6660,Home Office,TECHNOLOGY,,COPIERS,"Brother Fax and Copier, High-Speed",TEC-BRO-10003401,Zambia,Africa,Africa,...,,,,,,189.69,,Lusaka,Lusaka,"[Buy n Large, LexCorp, Olivander Crafts, Umbre..."
ZA-2014-50069,ZA-2014-6660,Home Office,TECHNOLOGY,,COPIERS,"Hewlett Fax Machine, High-Speed",TEC-HEW-10002304,Zambia,Africa,Africa,...,,,,,,318.12,,Lusaka,Lusaka,"[Buy n Large, Globex Corp, Hudsucker Industrie..."
ZA-2014-49187,ZA-2014-9750,Corporate,TECHNOLOGY,,ACCESSORIES,"Memorex Router, USB",TEC-MEM-10002202,Zambia,Africa,Africa,...,,,,,,246.42,,Ndola,Copperbelt,"[ACME Co, Royco Waystar, Umbrella Corporation]"
ZI-2014-42069,ZI-2014-7610,Corporate,TECHNOLOGY,,MACHINES,"StarTech Phone, Red",TEC-STA-10000699,Zimbabwe,Africa,Africa,...,,,,21.501,,,,Bulawayo,Bulawayo,[Dunder Mifflin]
ZI-2014-43712,ZI-2014-5970,Home Office,TECHNOLOGY,,ACCESSORIES,"Belkin Router, USB",TEC-BEL-10003985,Zimbabwe,Africa,Africa,...,,,,,,,77.688,Bulawayo,Bulawayo,"[ACME Co, Buy n Large, Hudsucker Industries]"
ZI-2014-48372,ZI-2014-9550,Consumer,TECHNOLOGY,,MACHINES,"Konica Receipt Printer, Red",TEC-KON-10003116,Zimbabwe,Africa,Africa,...,71.64,,,,,,,Bulawayo,Bulawayo,"[ACME Co, Wayne Enterprises]"
ZI-2014-48014,ZI-2014-3570,Consumer,TECHNOLOGY,,MACHINES,"Okidata Calculator, Red",TEC-OKI-10001433,Zimbabwe,Africa,Africa,...,,,,,,,,Harare,Harare,"[Dunder Mifflin, Olivander Crafts, Royco Wayst..."


<a id='shape'></a>

#### Get the shape and length of the dataframe

Pandas is built off of numpy, so many familiar functions/methods work with DataFrames    

* **[numpy shape function](https://numpy.org/doc/stable/reference/generated/numpy.shape.html)** can give the dimensions of a dataframe
    * <code>df.shape</code> - returns (rows, columns)
* **[len()](https://docs.python.org/3/library/functions.html#len)** - the built in python function will return the number of rows in a dataframe
    * <code>len(df)</code></div>


In [215]:
df.shape

(17531, 34)

In [216]:
len(df)

17531

<a id='.columns'></a>

#### List Columns

**[.columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html)** provides a list of the column labels for a dataframe
    
<code>df.columns</code>


In [217]:
# lets try it out
df.columns

Index(['Order ID', 'Segment', 'Category', 'Category (OLD)', 'Sub-Category',
       'Product Name', 'Product ID', 'Country', 'Market', 'Region', 'Quantity',
       'Discount', 'Profit', 'Customer ID', 'Customer Name', 'Order Priority',
       'Postal Code', 'Ship Mode', 'Shipping Cost', '10/1/2014', '7/1/2014',
       '11/1/2014', '9/1/2014', '1/1/2014', '12/1/2014', '8/1/2014',
       '5/1/2014', '3/1/2014', '4/1/2014', '2/1/2014', '6/1/2014', 'City',
       'State', 'manufacturers'],
      dtype='object')

<a id='dtypes'></a>

#### Show data from a single column

Like retrieving a value from a dictionary, we can get the data for a single column by indexing with the column's name:

In [218]:
df['Country']

Row ID
IN-2014-23218    Afghanistan
IN-2014-24599    Afghanistan
IN-2014-24597    Afghanistan
IN-2014-27993    Afghanistan
IN-2014-28967    Afghanistan
                    ...     
ZA-2014-49187         Zambia
ZI-2014-42069       Zimbabwe
ZI-2014-43712       Zimbabwe
ZI-2014-48372       Zimbabwe
ZI-2014-48014       Zimbabwe
Name: Country, Length: 17531, dtype: object

#### Get column or element data types

* **[.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html)** : lists the data type for all columns in a dataframe 
    * <code>df.dtypes</code>
    
* **type()** allows inspection of data type for a specific element (i.e. specific row & column)
    * This can be helpful if the .dtypes function returns "object"
    * <code>type(df['column'][row number])</code>


In [219]:
df.dtypes

Order ID           object
Segment            object
Category           object
Category (OLD)    float64
Sub-Category       object
Product Name       object
Product ID         object
Country            object
Market             object
Region             object
Quantity          float64
Discount          float64
Profit            float64
Customer ID        object
Customer Name      object
Order Priority     object
Postal Code       float64
Ship Mode          object
Shipping Cost     float64
10/1/2014         float64
7/1/2014          float64
11/1/2014         float64
9/1/2014          float64
1/1/2014          float64
12/1/2014         float64
8/1/2014          float64
5/1/2014          float64
3/1/2014          float64
4/1/2014          float64
2/1/2014          float64
6/1/2014          float64
City               object
State              object
manufacturers      object
dtype: object

In [220]:
# use the built in type() function to get the first row of the "Category" column
type(df["Category"]['ZA-2014-50068'])

str

<a id='unique'></a>

### Get descriptive statistics


Descriptive statistic are summary statistics that quantitatively describes or summarizes features from a collection of information or dataset. This typically included things like sample size, measures of central tendency (mean, median, mode), measures of variability or dispersion (standard deviation, min, max, kurtosis, skewness)
    
 
**describe** (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) will return descriptive statistics for all quantitative columns of a dataframe 

<img src="../support_files/images/pandas/pandas_describe.png">  

<code>df.describe()</code> 

For each numerical column the following descriptive statitsics are provided:
* count
* mean
* standard deviation
* minimum
* 25, 50 & 75th percentiles
* max


In [221]:
df.describe()

Unnamed: 0,Category (OLD),Quantity,Discount,Profit,Postal Code,Shipping Cost,10/1/2014,7/1/2014,11/1/2014,9/1/2014,1/1/2014,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014
count,0.0,17531.0,17531.0,17531.0,3321.0,17531.0,1626.0,1087.0,2147.0,2018.0,918.0,2153.0,1675.0,1284.0,1068.0,1051.0,756.0,1748.0
mean,,3.457989,0.143291,28.75854,56192.587775,26.268085,260.004077,237.999706,258.630194,238.432727,262.819777,233.694238,272.608921,224.611407,246.349038,230.991305,244.493857,229.870745
std,,2.290856,0.21183,174.283412,31977.397359,56.54526,543.523703,433.300176,520.039,463.761215,528.201441,432.184769,502.542821,396.292046,593.698325,442.449941,399.02282,423.420032
min,,1.0,0.0,-3839.9904,1841.0,0.01,0.99,1.08,1.197,1.359,2.04,1.161,1.584,1.188,0.556,1.188,1.788,0.444
25%,,2.0,0.0,0.0,28110.0,2.57,31.491,27.225,30.84,26.793,27.4125,30.36,36.984,30.105,32.202,28.545,35.46,29.4675
50%,,3.0,0.0,9.2,60440.0,7.72,87.94,78.327,90.936,79.62,83.9124,79.12,99.87,79.425,83.7888,79.14,93.765,79.96
75%,,5.0,0.2,36.8085,90032.0,24.455,267.678,244.1475,269.19,244.085,263.4915,232.88,284.043,226.73,246.533625,234.338,275.121,239.477925
max,,14.0,0.8,6719.9808,99301.0,867.69,11199.968,4001.04,10499.97,7958.58,5443.96,4864.32,5211.12,4298.85,13999.96,4799.984,3425.4,5486.67


#### get unique entries for a column

**<code>[.unique](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-specific-columns-from-a-dataframe)</code>**


<code>df['column_name'].unique()</code>   returns an array of all unique entries</div>


In [222]:
# get all unique entries for the Country column

df['Country'].unique()

array(['Afghanistan', 'Algeria', 'Angola', 'Argentina', 'Australia',
       'Austria', 'Azerbaijan', 'Bangladesh', 'Barbados', 'Belarus',
       'Belgium', 'Bolivia', 'Brazil', 'Bulgaria', 'Cambodia', 'Cameroon',
       'Canada', 'Chile', 'China', 'Colombia', "Cote d'Ivoire", 'Croatia',
       'Cuba', 'Czech Republic', 'Democratic Republic of the Congo',
       'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Estonia', 'Finland', 'France', 'Gabon', 'Georgia', 'Germany',
       'Ghana', 'Guatemala', 'Haiti', 'Honduras', 'Hungary', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kyrgyzstan',
       'Lebanon', 'Liberia', 'Libya', 'Lithuania', 'Macedonia',
       'Madagascar', 'Malaysia', 'Mali', 'Martinique', 'Mexico',
       'Moldova', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique',
       'Myanmar (Burma)', 'Nepal', 'Netherlands', 'New Zealand',
       'Nicaragua', 'Nige

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 7.1** What are the unique values for the Market column?

**Exercise 7.2** Identify the data type for the Product ID column
</div>

In [223]:
# Answer 6.1:
df['Market'].unique()

array(['APAC', 'Africa', 'LATAM', 'EU', 'EMEA', 'Canada', 'US'],
      dtype=object)

In [224]:
# Answer 6.2:
type(df["Product ID"])

pandas.core.series.Series

## Series

A `Series` is a one-dimensional labeled array holding data of any type  such as integers, strings, Python objects etc.

https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Series can be created from a dictionary, an ndarray, a scalar, or generated randomly using `np.random` 

https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html


In [225]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [226]:
s

a    0.934887
b    1.330699
c    0.902554
d   -0.174106
e   -0.316665
dtype: float64

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">



**Exercise 7.3** What is the average of the series?

**Exercise 7.4** What does `describe` do for a series?
</div>

In [227]:
# Answer 6.3:
s.mean()

0.5354738256817713

In [228]:
# Answer 6.4
s.describe()

count    5.000000
mean     0.535474
std      0.734218
min     -0.316665
25%     -0.174106
50%      0.902554
75%      0.934887
max      1.330699
dtype: float64

### other built in summary statistic functions

Pandas also provides a large set of summary functions that can operate on different kinds of pandas objects (dataframe columns, Series, Groupby etc.)

* **<code>[.count()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html?highlight=count#pandas.DataFrame.count)</code>**    
* **<code>[.sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html?highlight=sum#pandas.DataFrame.sum)</code>**
* **<code>[.min()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html?highlight=min#pandas.DataFrame.min)</code>**
* **<code>[.max()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html?highlight=max#pandas.DataFrame.max)</code>**
* **<code>[.mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html?highlight=mean#pandas.DataFrame.mean)</code>**
* **<code>[.median()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html?highlight=median#pandas.DataFrame.median)</code>**
* **<code>[.var()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html?highlight=var#pandas.DataFrame.var)</code>**
* **<code>[.std()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html?highlight=std#pandas.DataFrame.std)</code>**
    
These functions become especially useful in the next section where there might be some specific selection of  data you want to analyze

In [229]:
# here we get the standard deviation for the entire discount column
df["Discount"].std()

0.21183023108516105

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 7.5** What is the maximum value in the Shipping Cost column?
</div>

In [230]:
# Answer 6.5
df['Shipping Cost'].max()

867.69

<a id='dataselection'></a>

## Data Selection, Slicing & Dicing
   
**basic data selection & filtering**
* <a href='#column_selection'>Column selection</a>
* <a href='#view_vs_copy'>'view' of data vs a copy</a>    
* <a href='#loc'>Select data by labels/names</a>
* <a href='#row_number_vs_index'>A note on row number vs index</a>
* <a href='#iloc'>Select data by integer Index/position</a>
   
**advanced data selection**
* <a href='#conditionalselection'>Select data for a given condition or threshold</a>
* <a href='#conditionalselectionandreturn'>Return subset of data for a given condition or threshold</a>
* <a href='#multicondition'>Select data based on multiple conditions or thresholds</a>
* <a href='#multioptionlist'>Multiple condition options using a list</a>


<a id='column_selection'></a>

### Column Selection

**[Column Selection docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-specific-columns-from-a-dataframe)**
    
<img src="../support_files/images/pandas/pandas_select_columns.png">  

To view a single single column
* <code>df['column_name']</code>    (returns a pandas Series object, which is similar to a 1D array or list)

~ OR ~
* <code>df[['column_name']]</code>     (returns a DataFrame)

Multiple columns:
* <code>df[['column_1', 'columns_2']]</code></div>


In [231]:
# get the Category column as a pd.series

df["Category"]

Row ID
IN-2014-23218     FURNITURE
IN-2014-24599     FURNITURE
IN-2014-24597     FURNITURE
IN-2014-27993     FURNITURE
IN-2014-28967     FURNITURE
                    ...    
ZA-2014-49187    TECHNOLOGY
ZI-2014-42069    TECHNOLOGY
ZI-2014-43712    TECHNOLOGY
ZI-2014-48372    TECHNOLOGY
ZI-2014-48014    TECHNOLOGY
Name: Category, Length: 17531, dtype: object

In [232]:
# get the Category and Customer ID columns 

df[["Category", "Customer ID"]]

Unnamed: 0_level_0,Category,Customer ID
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1
IN-2014-23218,FURNITURE,AA-10375
IN-2014-24599,FURNITURE,CA-12055
IN-2014-24597,FURNITURE,CA-12055
IN-2014-27993,FURNITURE,GM-14455
IN-2014-28967,FURNITURE,VB-21745
...,...,...
ZA-2014-49187,TECHNOLOGY,TS-11205
ZI-2014-42069,TECHNOLOGY,BS-1380
ZI-2014-43712,TECHNOLOGY,JB-6045
ZI-2014-48372,TECHNOLOGY,JC-5775


<a id='view_vs_copy'></a>

#### Views vs copies

We often want to work with only a subset of a dataframe. For that purpose, we can  select only those rows or columns that we need and leave the rest.
    
Building off of our knowledge of views vs copies that we learned with our numpy tutorial, when we subset an array the result is not always a new array; sometimes what numpy returns is a view of the data in the original array.
Since pandas Series and DataFrames are backed by numpy arrays, it will probably come as no surprise that something similar sometimes happens in pandas. Unfortunately, while this behavior is relatively straightforward in numpy, in pandas there’s just no getting around the fact that it’s a hot mess.
    
**The View/Copy Headache in pandas**: In numpy, the rules for when you get views and when you don’t are a little complicated, but they are consistent: certain behaviors (like simple indexing) will always return a view, and others (fancy indexing) will never return a view.
    
But in pandas, whether you get a view or not—and whether changes made to a view will propagate back to the original DataFrame—depends on the structure and data types in the original DataFrame
    
<img src="../support_files/images/pandas/pandas_view_vs_copy_b.png">  

    
**When to create a copy using <code>[.copy()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html)</code>**
      
**Creating copies** of dataframes can be especially useful when doing exploratory analysis on a dataframe where you want to keep the integrity of the original dataframe in case you make a mistake.
    
<code>new_df = df.copy()</code> makes a copy (by creating a new object) of this object’s indices and data. By default modifications to the data or indices of the copy will not be reflected in the original object but see the documentation and the parameter <code>deep</code> for more information
    
Note: Like all other variables, try to keep your dataframe naming descriptive and intuitive to read!(For example "new_df" would be a bad name)
<img src="../support_files/images/pandas/pandas_copy_example_a.png" width='60%'>
<img src="../support_files/images/pandas/pandas_copy_example_b.png" width='60%'>


<a id='loc'></a>

### Select rows or columns using labels

**<code>[.loc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)</code>** allows access to a group of rows and columns by label(s) or a boolean array.

**Select rows:** 
* <code>df.loc['index']</code>
    * Similar to column selection described above, for row selection using df.loc[ ] returns a series where df.loc[[ ]] returns a dataframe

**Select row and column:**
* <code>df.loc['index', 'column']</code>


In [233]:
# get rows where the index is IN-2014-28967

df.loc["IN-2014-28967"]

Order ID                                              IN-2014-47337
Segment                                                   Corporate
Category                                                  FURNITURE
Category (OLD)                                                  NaN
Sub-Category                                                 CHAIRS
Product Name                                 Hon Rocking Chair, Red
Product ID                                          FUR-CH-10003965
Country                                                 Afghanistan
Market                                                         APAC
Region                                                 Central Asia
Quantity                                                        7.0
Discount                                                        0.0
Profit                                                       356.58
Customer ID                                                VB-21745
Customer Name                                   

In [234]:
# get index:ZI-2014-48372 and column'Country'
df.loc["ZI-2014-48372",'Country']

'Zimbabwe'

<a id='iloc'></a>

### Select rows and columns by position
    
**[.iloc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)** allows positional selection ie row number, column number. 

(Note that the same slicing notation that we saw with numpy arrays works with `iloc`.)
    
Example: <code>df.iloc([row(s) number, column(s) number])</code>

Examples: 
* Select all rows and all columns
    * <code>df.iloc[:,:]</code> 
* Select first 5 rows and all columns
    * <code>df.iloc[0:4, :]</code> 
* Select all rows and last 5 columns
    * <code>.iloc[:,-5:]</code>


In [235]:
# get the first 9 rows and all columns
df.iloc[:9,:]

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
IN-2014-23218,IN-2014-75456,Consumer,FURNITURE,,FURNISHINGS,"Rubbermaid Door Stop, Erganomic",FUR-FU-10004064,Afghanistan,APAC,Central Asia,...,,,,,,,,Kabul,Kabul,"[Dunder Mifflin, Globex Corp, Hudsucker Indust..."
IN-2014-24599,IN-2014-29767,Home Office,FURNITURE,,BOOKCASES,"Ikea Library with Doors, Mobile",FUR-BO-10001255,Afghanistan,APAC,Central Asia,...,,,,731.82,,,,Herat,Hirat,"[ACME Co, Buy n Large, Dunder Mifflin, Globex ..."
IN-2014-24597,IN-2014-29767,Home Office,FURNITURE,,FURNISHINGS,"Rubbermaid Door Stop, Erganomic",FUR-FU-10004064,Afghanistan,APAC,Central Asia,...,,,,169.68,,,,Herat,Hirat,"[Dunder Mifflin, LexCorp, Olivander Crafts, Ro..."
IN-2014-27993,IN-2014-20415,Home Office,FURNITURE,,BOOKCASES,"Bush Classic Bookcase, Pine",FUR-BO-10002204,Afghanistan,APAC,Central Asia,...,,2070.15,,,,,,Kabul,Kabul,"[Dunder Mifflin, Olivander Crafts]"
IN-2014-28967,IN-2014-47337,Corporate,FURNITURE,,CHAIRS,"Hon Rocking Chair, Red",FUR-CH-10003965,Afghanistan,APAC,Central Asia,...,914.34,,,,,,,Kabul,Kabul,"[ACME Co, Buy n Large, Dunder Mifflin, LexCorp..."
AG-2014-50986,AG-2014-2760,Consumer,FURNITURE,,FURNISHINGS,"Deflect-O Light Bulb, Erganomic",FUR-DEF-10002865,Algeria,Africa,Africa,...,,,,,,,17.61,Saida,Saida,"[ACME Co, Hudsucker Industries, Wayne Enterpri..."
AG-2014-50983,AG-2014-2760,Consumer,FURNITURE,,CHAIRS,"Novimex Rocking Chair, Black",FUR-NOV-10002453,Algeria,Africa,Africa,...,,,,,,,516.0,Saida,Saida,"[Dunder Mifflin, Umbrella Corporation]"
AG-2014-49384,AG-2014-2040,Consumer,FURNITURE,,FURNISHINGS,"Rubbermaid Frame, Durable",FUR-RUB-10003004,Algeria,Africa,Africa,...,,,,,,,,Algiers,Alger,"[Buy n Large, Hudsucker Industries, LexCorp, R..."
AG-2014-47438,AG-2014-2600,Home Office,FURNITURE,,BOOKCASES,"Ikea 3-Shelf Cabinet, Pine",FUR-IKE-10003642,Algeria,Africa,Africa,...,,,,,,,,Algiers,Alger,"[Buy n Large, Dunder Mifflin, Hudsucker Indust..."


In [236]:
# get all rows and last 4 columns 
df.iloc[:,-4:]

Unnamed: 0_level_0,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
IN-2014-23218,,Kabul,Kabul,"[Dunder Mifflin, Globex Corp, Hudsucker Indust..."
IN-2014-24599,,Herat,Hirat,"[ACME Co, Buy n Large, Dunder Mifflin, Globex ..."
IN-2014-24597,,Herat,Hirat,"[Dunder Mifflin, LexCorp, Olivander Crafts, Ro..."
IN-2014-27993,,Kabul,Kabul,"[Dunder Mifflin, Olivander Crafts]"
IN-2014-28967,,Kabul,Kabul,"[ACME Co, Buy n Large, Dunder Mifflin, LexCorp..."
...,...,...,...,...
ZA-2014-49187,,Ndola,Copperbelt,"[ACME Co, Royco Waystar, Umbrella Corporation]"
ZI-2014-42069,,Bulawayo,Bulawayo,[Dunder Mifflin]
ZI-2014-43712,77.688,Bulawayo,Bulawayo,"[ACME Co, Buy n Large, Hudsucker Industries]"
ZI-2014-48372,,Bulawayo,Bulawayo,"[ACME Co, Wayne Enterprises]"


<a id='conditionalselection'></a>

#### Select data given a specific threshold or condition

You can conditionally select data using **[.loc[ ]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)** or **[.query()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)** or directly on the dataframe. Each of these utilize boolean masking. 
     
<code>.loc</code>
* <code>df.loc[df['column'] > 7]</code>
* <code>df.loc[df['column'] == 'string']</code>
    * Pros: explicit, can be easier to read for those who already know python
    * Cons: more verbose than query
    
----
<code>.query()</code>:
This method uses boolean expressions and may be easier & more intuitive for those who know sql or other database languages. 

Examples:
* <code>df.query('column > 7')</code>
* <code>df.query('column == string')</code>
    * Pros:
        * can be easier for those who know sql or other database languages
        * less verbose
    * Cons:
        * can be difficult for those that donn't know sql
        * doesn't handle column names that contain spaces very well


In [237]:
# use .loc to select where Profit > 500
df.loc[df['Profit'] > 500]

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
IN-2014-27993,IN-2014-20415,Home Office,FURNITURE,,BOOKCASES,"Bush Classic Bookcase, Pine",FUR-BO-10002204,Afghanistan,APAC,Central Asia,...,,2070.15,,,,,,Kabul,Kabul,"[Dunder Mifflin, Olivander Crafts]"
AO-2014-44396,AO-2014-2270,Consumer,FURNITURE,,TABLES,"Chromcraft Round Table, Adjustable Height",FUR-CHR-10002278,Angola,Africa,Africa,...,,,1877.16,,,,,Luanda,Luanda,"[Dunder Mifflin, Globex Corp, LexCorp, Royco W..."
IN-2014-24624,IN-2014-23880,Consumer,FURNITURE,,BOOKCASES,"Sauder Classic Bookcase, Traditional",FUR-BO-10004852,Australia,APAC,Oceania,...,,,,,,,3139.128,Mandurah,Western Australia,"[Buy n Large, Globex Corp, LexCorp, Umbrella C..."
IN-2014-31153,IN-2014-80286,Consumer,FURNITURE,,CHAIRS,"Harbour Creations Executive Leather Armchair, ...",FUR-CH-10000051,Australia,APAC,Oceania,...,,,,,,,,Wollongong,New South Wales,"[Globex Corp, LexCorp, Royco Waystar, Umbrella..."
ID-2014-21496,ID-2014-35892,Consumer,FURNITURE,,CHAIRS,"SAFCO Executive Leather Armchair, Set of Two",FUR-CH-10003597,Australia,APAC,Oceania,...,,,,,,,,Brisbane,Queensland,"[ACME Co, Buy n Large, Dunder Mifflin, Globex ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CA-2014-33921,CA-2014-127180,Home Office,TECHNOLOGY,,PHONES,Polycom CX600 IP Phone VoIP phone,TEC-PH-10001494,United States,US,East,...,,,,,,,,New York City,New York,"[Buy n Large, Dunder Mifflin, Hudsucker Indust..."
CA-2014-33920,CA-2014-127180,Home Office,TECHNOLOGY,,COPIERS,Canon imageCLASS 2200 Advanced Copier,TEC-CO-10004722,United States,US,East,...,,,,,,,,New York City,New York,[Globex Corp]
CA-2014-36178,CA-2014-143567,Corporate,TECHNOLOGY,,ACCESSORIES,Logitech diNovo Edge Keyboard,TEC-AC-10004145,United States,US,South,...,,,,,,,,Henderson,Kentucky,"[Dunder Mifflin, Globex Corp, Hudsucker Indust..."
CA-2014-37637,CA-2014-143112,Corporate,TECHNOLOGY,,MACHINES,"3D Systems Cube Printer, 2nd Generation, Magenta",TEC-MA-10001047,United States,US,East,...,,,,,,,,New York City,New York,"[Buy n Large, Hudsucker Industries, LexCorp, O..."


<a id='conditionalselectionandreturn'></a>

#### Return a subset of data given a specific threshold or condition

You can also return specific columns after doing conditional selection data using **[<code>.loc</code>](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)**. 
  
Single return column:
* <code>df.loc[df['selection_column'] > 7, 'return_column']</code>

Multiple return columns (provide columns as a list):
* <code>df.loc[df['selection_column'] > 7, ['return_column1', 'return_column2']]</code>


In [238]:
# get the Category and Sub-Category where profit is above 500

df.loc[df["Profit"] > 500, ["Category", "Sub-Category"]]

Unnamed: 0_level_0,Category,Sub-Category
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1
IN-2014-27993,FURNITURE,BOOKCASES
AO-2014-44396,FURNITURE,TABLES
IN-2014-24624,FURNITURE,BOOKCASES
IN-2014-31153,FURNITURE,CHAIRS
ID-2014-21496,FURNITURE,CHAIRS
...,...,...
CA-2014-33921,TECHNOLOGY,PHONES
CA-2014-33920,TECHNOLOGY,COPIERS
CA-2014-36178,TECHNOLOGY,ACCESSORIES
CA-2014-37637,TECHNOLOGY,MACHINES


<a id='multicondition'></a>

#### Multiple condition selection

You may wish to select rows or columns where multiple conditions are met. You can combine conditions with <code>.loc</code> and like in numpy you will utilize <code>&</code> and <code>|</code> 

example:
<code>df.loc[(df['column1']=='string') & (df['column2'] > threshold)]</code>


In [239]:
# return all entries where Profit is greater than 900 and the Sub-Category is "Bookcases"
# the & means both conditions must be met

df.loc[(df['Sub-Category']=="BOOKCASES") & (df['Profit'] >900)]

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
IN-2014-24624,IN-2014-23880,Consumer,FURNITURE,,BOOKCASES,"Sauder Classic Bookcase, Traditional",FUR-BO-10004852,Australia,APAC,Oceania,...,,,,,,,3139.128,Mandurah,Western Australia,"[Buy n Large, Globex Corp, LexCorp, Umbrella C..."
IN-2014-21263,IN-2014-56206,Consumer,FURNITURE,,BOOKCASES,"Sauder Classic Bookcase, Metal",FUR-BO-10001471,Australia,APAC,Oceania,...,,,,,,,5486.67,Sydney,New South Wales,"[ACME Co, Buy n Large, Hudsucker Industries, U..."
IN-2014-24037,IN-2014-17930,Consumer,FURNITURE,,BOOKCASES,"Sauder Classic Bookcase, Metal",FUR-BO-10001471,China,APAC,North Asia,...,,,,,,,,Qingdao,Shandong,"[ACME Co, Globex Corp, Hudsucker Industries, O..."
IN-2014-28056,IN-2014-66573,Consumer,FURNITURE,,BOOKCASES,"Bush Classic Bookcase, Mobile",FUR-BO-10004665,China,APAC,North Asia,...,,,,,,,,Beijing,Beijing,"[ACME Co, Buy n Large, Olivander Crafts, Umbre..."
MX-2014-4452,MX-2014-157077,Consumer,FURNITURE,,BOOKCASES,"Dania Classic Bookcase, Traditional",FUR-BO-10002300,Cuba,LATAM,Caribbean,...,,,,,,,,Camagüey,Camagüey,"[ACME Co, Buy n Large, Globex Corp, Royco Ways..."
IN-2014-25795,IN-2014-76016,Corporate,FURNITURE,,BOOKCASES,"Sauder Classic Bookcase, Traditional",FUR-BO-10004852,India,APAC,Central Asia,...,,,,,,,,Thiruvananthapuram,Kerala,"[Buy n Large, Hudsucker Industries, Umbrella C..."
IT-2014-16653,IT-2014-4540740,Consumer,FURNITURE,,BOOKCASES,"Safco Classic Bookcase, Metal",FUR-BO-10004999,Spain,EU,South,...,2188.05,,,,,,,Seville,Andalusía,"[Globex Corp, Hudsucker Industries, Olivander ..."
ES-2014-12449,ES-2014-4957212,Consumer,FURNITURE,,BOOKCASES,"Bush Classic Bookcase, Pine",FUR-BO-10004709,United Kingdom,EU,North,...,,,,,,,,Burnley,England,"[Buy n Large, Globex Corp, LexCorp, Olivander ..."
ZA-2014-42448,ZA-2014-7540,Consumer,FURNITURE,,BOOKCASES,"Ikea Library with Doors, Traditional",FUR-IKE-10002894,Zambia,Africa,Africa,...,,,,,,,,Lusaka,Lusaka,"[ACME Co, Royco Waystar, Umbrella Corporation,..."


In [240]:
# return all entries where the Country is "Australia" or the Country is "Zambia"
# the | means at least one condition must be met

df.loc[(df['Country']=="Australia") | (df['Country']=="Zambia")]

Unnamed: 0_level_0,Order ID,Segment,Category,Category (OLD),Sub-Category,Product Name,Product ID,Country,Market,Region,...,12/1/2014,8/1/2014,5/1/2014,3/1/2014,4/1/2014,2/1/2014,6/1/2014,City,State,manufacturers
Row ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID-2014-25801,ID-2014-42388,Consumer,FURNITURE,,FURNISHINGS,"Deflect-O Frame, Duo Pack",FUR-FU-10004780,Australia,APAC,Oceania,...,,99.09,,,,,,Wollongong,New South Wales,[Hudsucker Industries]
IN-2014-21802,IN-2014-25644,Consumer,FURNITURE,,BOOKCASES,"Bush Corner Shelving, Metal",FUR-BO-10004230,Australia,APAC,Oceania,...,444.42,,,,,,,Tamworth,New South Wales,"[Hudsucker Industries, LexCorp]"
IN-2014-22521,IN-2014-10090,Consumer,FURNITURE,,CHAIRS,"Harbour Creations Steel Folding Chair, Red",FUR-CH-10001204,Australia,APAC,Oceania,...,,,,,88.101,,,Sydney,New South Wales,"[Buy n Large, Dunder Mifflin, Globex Corp, Oli..."
IN-2014-26860,IN-2014-64396,Home Office,FURNITURE,,FURNISHINGS,"Deflect-O Stacking Tray, Black",FUR-FU-10001129,Australia,APAC,Oceania,...,,,,,,,91.26,Bundaberg,Queensland,"[ACME Co, Buy n Large, Dunder Mifflin, LexCorp..."
ID-2014-30742,ID-2014-84164,Home Office,FURNITURE,,FURNISHINGS,"Tenex Photo Frame, Duo Pack",FUR-FU-10000507,Australia,APAC,Oceania,...,,,,31.14,,,,Cairns,Queensland,"[ACME Co, Dunder Mifflin, Olivander Crafts]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZA-2014-41769,ZA-2014-1370,Consumer,TECHNOLOGY,,COPIERS,"Brother Personal Copier, Color",TEC-BRO-10003986,Zambia,Africa,Africa,...,,,,,,,,Chingola,Copperbelt,"[ACME Co, Hudsucker Industries, Olivander Craf..."
ZA-2014-45397,ZA-2014-8330,Corporate,TECHNOLOGY,,PHONES,"Cisco Headset, VoIP",TEC-CIS-10003439,Zambia,Africa,Africa,...,,88.53,,,,,,Lusaka,Lusaka,"[Buy n Large, Olivander Crafts, Umbrella Corpo..."
ZA-2014-50068,ZA-2014-6660,Home Office,TECHNOLOGY,,COPIERS,"Brother Fax and Copier, High-Speed",TEC-BRO-10003401,Zambia,Africa,Africa,...,,,,,,189.69,,Lusaka,Lusaka,"[Buy n Large, LexCorp, Olivander Crafts, Umbre..."
ZA-2014-50069,ZA-2014-6660,Home Office,TECHNOLOGY,,COPIERS,"Hewlett Fax Machine, High-Speed",TEC-HEW-10002304,Zambia,Africa,Africa,...,,,,,,318.12,,Lusaka,Lusaka,"[Buy n Large, Globex Corp, Hudsucker Industrie..."


<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">


**EXERCISE 6.6** Select Row ID "AG-2014-50983" and all columns

**EXERCISE 6.7** Get the mean profit when the Category is furniture
    
**EXERCISE 6.8** How many entries in the dataframe are there where "Profit" is negative and "Market" is 'EU'</div>


In [241]:
# Answer 6.6:
df.loc["AG-2014-50983"]

Order ID                                    AG-2014-2760
Segment                                         Consumer
Category                                       FURNITURE
Category (OLD)                                       NaN
Sub-Category                                      CHAIRS
Product Name                Novimex Rocking Chair, Black
Product ID                              FUR-NOV-10002453
Country                                          Algeria
Market                                            Africa
Region                                            Africa
Quantity                                             4.0
Discount                                             0.0
Profit                                             61.92
Customer ID                                      CL-2565
Customer Name                                Clay Ludtke
Order Priority                                      High
Postal Code                                      41244.0
Ship Mode                      

In [242]:
# Answer 6.7 with loc:
df.loc[df['Category']=='FURNITURE', 'Profit'].mean()


26.68421359426352

In [243]:
# Answer 6.8
# Return the dataframe where "Profit" is negative and "Market" is 'EU'

# using .loc:
len(df.loc[(df['Profit'] < 0) & (df['Market'] == 'EU')])

# using .query:
len(df.query('Profit < 0 & Market == "EU"'))

768