# Filtering, Selecting, Sorting 

As a data scientist, we will need to slice and dice the data on a regular basis in order to create meaningful insights from the
data, in this section, we will look at various manipulation tasks such as: filtering, selecting, sorting

In this section, we will be using a US store sales data, it contains several columns such as the row ID, the order ID which is a unique order ID, the order date, when a particular product was ordered, the ship date, when the product was actually shipped, the mode of shipping, the customer ID, the customer name, what is the segment the customer belongs to (all the observations from this data are from the country United States), which city does this order belong to, which state does this order belong to, all the observations from this data are from the country United States, which city does this order belong to, which state does this order belong to, sales (the total billing amount in dollars), the total quantity purchased, the discount offered, if any, and the total profit the retailer has got from this sale. 

In [1]:
# For data manipulation, we will start with the standard imports by loading 
# the pandas and the numpy libraries.The numpy module helps in performing 
# mathematical operations on the pandas dataframe which are built on pandas series.

import pandas as pd
import numpy as np
from pandas import DataFrame
import os

# Modify the location to the folder where the files have been copied
os.chdir('C:\\Python Code\\Data Manipulation with Pandas\\Filtering, Selecting, Sorting and Adding New Columns')

In [2]:
data=pd.read_csv('Store.csv',sep=',',header=0, encoding="latin")

In [3]:
# In real time analytics project, a data set is usually too large even to be opened in excel. 
# In such a case, the head method of pandas help us to understand what is actually there in the
# data by glimpsing at the first few initial observations. It displays the first 5 rows.

data.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2013-152156,11/9/2013,11/12/2013,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2013-152156,11/9/2013,11/12/2013,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2013-138688,6/13/2013,6/17/2013,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2012-108966,10/11/2012,10/18/2012,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2012-108966,10/11/2012,10/18/2012,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [4]:
# The shape attribute on the pandas dataframe gives the 
# total number of observations and the total number of 
# columns in the data

data.shape

(9994, 21)

In [5]:
# Display the list of all columns names.

print(data.columns.tolist())

['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit']


In [6]:
# How many unique cities are the orders being delivered to ?

print(data['City'].unique().tolist())

['Henderson', 'Los Angeles', 'Fort Lauderdale', 'Concord', 'Seattle', 'Fort Worth', 'Madison', 'West Jordan', 'San Francisco', 'Fremont', 'Philadelphia', 'Orem', 'Houston', 'Richardson', 'Naperville', 'Melbourne', 'Eagan', 'Westland', 'Dover', 'New Albany', 'New York City', 'Troy', 'Chicago', 'Gilbert', 'Springfield', 'Jackson', 'Memphis', 'Decatur', 'Durham', 'Columbia', 'Rochester', 'Minneapolis', 'Portland', 'Saint Paul', 'Aurora', 'Charlotte', 'Orland Park', 'Urbandale', 'Columbus', 'Bristol', 'Wilmington', 'Bloomington', 'Phoenix', 'Roseville', 'Independence', 'Pasadena', 'Newark', 'Franklin', 'Scottsdale', 'San Jose', 'Edmond', 'Carlsbad', 'San Antonio', 'Monroe', 'Fairfield', 'Grand Prairie', 'Redlands', 'Hamilton', 'Westfield', 'Akron', 'Denver', 'Dallas', 'Whittier', 'Saginaw', 'Medina', 'Dublin', 'Detroit', 'Tampa', 'Santa Clara', 'Lakeville', 'San Diego', 'Brentwood', 'Chapel Hill', 'Morristown', 'Cincinnati', 'Inglewood', 'Tamarac', 'Colorado Springs', 'Belleville', 'Taylor

In [7]:
# To know the count of the unique cities, we can simply use a length function on this
len(data['City'].unique())


531

### To know the total quantity sold in the 'East' Region

In [8]:
# Step 1 - Initially, filter the data by east region
# we can filter for the value 'east' using a double == symbol

data[data['Region']=="East"].head(10)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
23,24,US-2014-156909,7/17/2014,7/19/2014,Second Class,SF-20065,Sandra Flanagan,Consumer,United States,Philadelphia,...,19140,East,FUR-CH-10002774,Furniture,Chairs,"Global Deluxe Stacking Chair, Gray",71.372,2,0.3,-1.0196
27,28,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,FUR-BO-10004834,Furniture,Bookcases,"Riverside Palais Royal Lawyers Bookcase, Royal...",3083.43,7,0.5,-1665.0522
28,29,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,OFF-BI-10000474,Office Supplies,Binders,Avery Recycled Flexi-View Covers for Binding S...,9.618,2,0.7,-7.0532
29,30,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,FUR-FU-10004848,Furniture,Furnishings,"Howard Miller 13-3/4"" Diameter Brushed Chrome ...",124.2,3,0.2,15.525
30,31,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,OFF-EN-10001509,Office Supplies,Envelopes,Poly String Tie Envelopes,3.264,2,0.2,1.1016
31,32,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,OFF-AR-10004042,Office Supplies,Art,"BOSTON Model 1800 Electric Pencil Sharpeners, ...",86.304,6,0.2,9.7092
32,33,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,OFF-BI-10001525,Office Supplies,Binders,"Acco Pressboard Covers with Storage Hooks, 14 ...",6.858,6,0.7,-5.715
33,34,US-2012-150630,9/17/2012,9/21/2012,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,...,19140,East,OFF-AR-10001683,Office Supplies,Art,Lumber Crayons,15.76,2,0.2,3.546
47,48,CA-2013-169194,6/21/2013,6/26/2013,Standard Class,LH-16900,Lena Hernandez,Consumer,United States,Dover,...,19901,East,TEC-AC-10002167,Technology,Accessories,Imation 8gb Micro Traveldrive Usb 2.0 Flash Drive,45.0,3,0.0,4.95
48,49,CA-2013-169194,6/21/2013,6/26/2013,Standard Class,LH-16900,Lena Hernandez,Consumer,United States,Dover,...,19901,East,TEC-PH-10003988,Technology,Phones,"LF Elite 3D Dazzle Designer Hard Case Cover, L...",21.8,2,0.0,6.104


In [9]:
# To find the number of rows we can use the shape command with index as 0

data[data['Region']=="East"].shape[0]

2848

In [10]:
# A slightly more elegant way to do the above operations is to use the query method

data.query("Region=='East'").shape[0]

2848

In [11]:
# Step 2 - To get a specific column after applying the filter - provide the column name within
# the square bracket.

data.query("Region=='East'")['Quantity'].head(10)

23    2
27    7
28    2
29    3
30    2
31    6
32    6
33    2
47    3
48    2
Name: Quantity, dtype: int64

In [12]:
# Step 3 - To find the total sum of all the quantities in the east region
# apply the method sum on the filtered data on the columns -'Quantity' .

data.query("Region=='East'")['Quantity'].sum()

10618

## Sorting data

Sort operations can be performed on data frames to sort the data in ascending or descending order of one or more columns. 

The methods "sort_values" can be applied on the dataframe. By default, the sorting is performed in ascending order. This can be reversed by providing the attribute "ascending" as False. 

The data frame, in-turn, can have filters applied on them. In this case, only the resulting data is sorted. The dot (.) operator is similar to the pipe operation and performs the method (specified) on the previous output.

### Display the most valuable customers in South Region by Sales

In [13]:
# Use the .query method to filter the rows in the dataframe. 
# based on a specific criteria.Apply the method sort_values on 
# the result.Display the 1st five rows using the head method. 

data.query("Region=='South'").sort_values('Sales',ascending=False).head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
2697,2698,CA-2011-145317,3/18/2011,3/23/2011,Standard Class,SM-20320,Sean Miller,Home Office,United States,Jacksonville,...,32216,South,TEC-MA-10002412,Technology,Machines,Cisco TelePresence System EX90 Videoconferenci...,22638.48,6,0.5,-1811.0784
8488,8489,CA-2013-158841,2/2/2013,2/4/2013,Second Class,SE-20110,Sanjit Engle,Consumer,United States,Arlington,...,22204,South,TEC-MA-10001127,Technology,Machines,HP Designjet T520 Inkjet Large Format Printer ...,8749.95,5,0.0,2799.984
683,684,US-2014-168116,11/5/2014,11/5/2014,Same Day,GT-14635,Grant Thornton,Corporate,United States,Burlington,...,27217,South,TEC-MA-10004125,Technology,Machines,Cubify CubeX 3D Printer Triple Head Print,7999.98,4,0.5,-3839.9904
509,510,CA-2012-145352,3/16/2012,3/22/2012,Standard Class,CM-12385,Christopher Martinez,Consumer,United States,Atlanta,...,30318,South,OFF-BI-10003527,Office Supplies,Binders,Fellowes PB500 Electric Punch Plastic Comb Bin...,6354.95,5,0.0,3177.475
4297,4298,CA-2014-129021,8/24/2014,8/27/2014,Second Class,PO-18850,Patrick O'Brill,Consumer,United States,Tallahassee,...,32303,South,TEC-PH-10001459,Technology,Phones,Samsung Galaxy Mega 6.3,4367.896,13,0.2,327.5922


### Select the top 10 customer id from this sorted data

In [14]:
# We are assuming that the Customer IDs are unique and are not repeating. 
# In case, the Customer Ids are repeating between rows, we will have to 
# use a 'goupby' clause. 

data.query("Region=='South'").sort_values('Sales',ascending=False)['Customer ID'].head(10)

2697    SM-20320
8488    SE-20110
683     GT-14635
509     CM-12385
4297    PO-18850
9639    JH-15985
3280    GM-14695
7583    KH-16690
4093    KW-16435
1454    MC-17425
Name: Customer ID, dtype: object

### In the East Region who are the most valuable customers?

In [15]:
# Display the list of Top 10 customer from the Region - 'East' 
# who has the most Sales value.

data.query("Region=='East'").sort_values('Sales',ascending=False)['Customer ID'].head(10)

2623    TA-21385
4190    HL-15040
4277    BS-11365
6425    CC-12370
6626    TB-21400
7666    DR-12940
6340    TS-21370
1085    KD-16270
1803    JA-15970
8204    KD-16495
Name: Customer ID, dtype: object