---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.6</h1>

## _Subsetting Dataframes.ipynb_

## Motivation:
- The ability to select specific rows and columns to access and filter data based on specific conditions are two of the key features of Pandas.
    - **Selection** allows you to access specific rows or columns (a subset) of the data by their index and/or location in the DataFrame
        - In large datasets, you may be required to select the first/last N records
        - In large datasets, you may be required to select a range (n to m) of records
        - In large datasets, you may be required to select specific columns of your interest
        - In large datasets, you may be required to select specific range and specific columns of your interest
    - **Filtering** allows you to access specific rows or columns (a subset) of the data based on one or more conditions
        - In a medical dataset, you may be required to filter record of all those patients who suffer with a specific disease, or who have a specific blood group
        - In a medical dataset, you may be required to filter pregnant women who have anemia, and compare this subset to women who don’t have anemia.
        - In a travel dataset, you may be required to filter hotels inside Lahore city, sorted by their minimum per day cost
        - In a client dataset, you may be required filter the clients who use a Gmail account(may require a string filter)
        - In a client dataset, you may be required to filter the clients who belong to a specific countries (may require use of .isin() function)

## Learning agenda of this notebook

1. How to select rows of a dataframe?
    - First read a dataset and get some insights bout the data
    - Select first/last 'N' rows based on their position index
    - Select rows in a particular range using slice object
2. How to select columns of a dataframe?
    - Select single column from a dataframe
    - Select multiple columns from a dataframe
    - Get the subset based on a value of a column
3. Understanding index of a dataframe
    - What is an index?
    - Can we change the index?
    - Will the index be always numeric?
    - Can we reset the index?
4. Use of loc and iloc attributes
    - Creating a basic dataframe from scratch
    - Slicing a dataframe having positional index
    - Slicing a dataframe having categorical variable as index
    - Understanding label of loc and index of iloc
    - Using list of  labels and integers with loc and iloc respectively
    - Subsetting specific rows with specific columns with loc and iloc
    - Selecting rows based on a condition
    - Selecting rows based on multiple conditions
    - Conditional selection and viewing specific columns
5. Selecting columns  with specific data types?
6. Practice session on filtering data
  
**`DATA SET:`** This notebook, use the StudentsPerformance.csv and big_mart_sales.csv other than creating a dataframe from scratch to simplify some concepts

## 1. How to select rows of a dataframe

### a. First read a csv file and get some insights about the data

In [1]:
# import the pandas library
import pandas as pd

# read 'datasets/sample2.csv' file
df = pd.read_csv('datasets/sample2.csv')
df

Unnamed: 0,rollno,gender,group,age,math,english,urdu
0,MS01,female,group B,28.0,72.0,72,74.0
1,MS02,female,group C,33.0,69.0,90,88.0
2,MS03,female,group B,21.0,,95,93.0
3,MS04,male,group A,44.0,47.0,57,44.0
4,MS05,male,group C,54.0,76.0,78,
5,MS06,female,group B,,71.0,83,78.0
6,MS07,female,group B,47.0,88.0,95,92.0
7,MS08,male,group B,33.0,40.0,43,39.0
8,MS09,male,group D,27.0,64.0,64,67.0
9,MS10,female,group B,33.0,38.0,60,50.0


In [2]:
# Shape of dataframe
df.shape

(50, 7)

In [3]:
# Describe the index
df.index

RangeIndex(start=0, stop=50, step=1)

In [4]:
# Column names
df.columns

Index(['rollno', 'gender', 'group', 'age', 'math', 'english', 'urdu'], dtype='object')

In [5]:
# To check number on non-NA values
df.count()

rollno     50
gender     50
group      50
age        47
math       47
english    50
urdu       47
dtype: int64

In [6]:
# Data Types of each column
df.dtypes

rollno      object
gender      object
group       object
age        float64
math       float64
english      int64
urdu       float64
dtype: object

In [7]:
#This method prints information about a DataFrame including the index dtype, total columns, non-null values and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   rollno   50 non-null     object 
 1   gender   50 non-null     object 
 2   group    50 non-null     object 
 3   age      47 non-null     float64
 4   math     47 non-null     float64
 5   english  50 non-null     int64  
 6   urdu     47 non-null     float64
dtypes: float64(3), int64(1), object(3)
memory usage: 2.9+ KB


### b. Select first/last 'N' rows based on their position index?

In [8]:
# head(n) returns the first n rows for the object based on position.  Default value of n is 5
# For negative values of n, this function returns all rows except the last `n` rows, equivalent to df[:-n].
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
df.head(7)

Unnamed: 0,rollno,gender,group,age,math,english,urdu
0,MS01,female,group B,28.0,72.0,72,74.0
1,MS02,female,group C,33.0,69.0,90,88.0
2,MS03,female,group B,21.0,,95,93.0
3,MS04,male,group A,44.0,47.0,57,44.0
4,MS05,male,group C,54.0,76.0,78,
5,MS06,female,group B,,71.0,83,78.0
6,MS07,female,group B,47.0,88.0,95,92.0


In [9]:
# tail(n) function returns last n rows from the object based on position. Default value of n is 5
# It is useful for quickly verifying data, for example,after sorting or appending rows.
# For negative values of `n`, this function returns all rows except the first `n` rows, equivalent to df[n:]
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
df.tail(3)

Unnamed: 0,rollno,gender,group,age,math,english,urdu
47,MS48,female,group C,30.0,66.0,71,76.0
48,MS49,female,group D,40.0,57.0,74,76.0
49,MS50,male,group C,37.0,66.0,78,81.0


### c. Select rows in a particular range using slice object?

In [2]:
#Slice object: [start:stop:step] is an object that contains a portion of a sequence.
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')

#df[8:]      #If only start is given, it generates a portion of sequence from start index till the last element

#df[:8]      #If only stop is given, it generates a portion of sequence from index 0 to stop, where stop is excluded.

#df[5:10]     #If both start and stop are given, it generates a portion of sequence from start index till the stop where the stop is excluded.

df[5:10:2]    #If start, stop, and step are provided, it generates a portion of the sequence after the index start until stop with an increment of index step

#Note: You cannot access elements of a dataframe by a single index or using a list as you can do in case of series

Unnamed: 0,rollno,gender,group,age,math,english,urdu
5,MS06,female,group B,,71.0,83,78.0
7,MS08,male,group B,33.0,40.0,43,39.0
9,MS10,female,group B,33.0,38.0,60,50.0


## 2. How to select columns of a dataframe?

### a. Select a single column from a dataframe

In [11]:
# We can select a single column using single brackets and the name of that column in single bracket. 
# The result is a Series object with its own set of attributes and methods. 
# These objects are like arrays and are the building blocks of DataFrames, each DataFrame is made up of a set of Series.

import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
s1 = df['gender']
print(s1)
type(s1)

0     female
1     female
2     female
3       male
4       male
5     female
6     female
7       male
8       male
9     female
10      male
11      male
12    female
13      male
14    female
15    female
16      male
17    female
18      male
19    female
20      male
21    female
22      male
23    female
24      male
25      male
26      male
27    female
28      male
29    female
30    female
31    female
32    female
33      male
34      male
35      male
36    female
37    female
38    female
39      male
40      male
41    female
42    female
43      male
44    female
45      male
46    female
47    female
48    female
49      male
Name: gender, dtype: object


pandas.core.series.Series

### b. Select a multiple columns from a dataframe

In [3]:
#To select multiple columns at once, we use double brackets and commas between column names
#The result is a new DataFrame object with the selected columns. 
d1 = df[['gender', 'math']].head()
print(d1)
type(d1)

Unnamed: 0,gender,math
0,female,72.0
1,female,69.0
2,female,
3,male,47.0
4,male,76.0


### c. Get the subset based on a value of a column

In [13]:
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
df['math'] < 50       # returns true or false and we can use it as index of dataframe
df[df['math'] < 50]   # will return records where the condition is true

Unnamed: 0,rollno,gender,group,age,math,english,urdu
3,MS04,male,group A,44.0,47.0,57,44.0
7,MS08,male,group B,33.0,40.0,43,39.0
9,MS10,female,group B,33.0,38.0,60,50.0
11,MS12,male,group D,53.0,40.0,52,43.0
17,MS18,female,group B,31.0,18.0,32,28.0
18,MS19,male,group C,33.0,46.0,42,46.0
22,MS23,male,group D,31.0,44.0,54,53.0
33,MS34,male,group D,34.0,40.0,42,38.0


## 3. Understanding index of dataframe

In [14]:
# Let us see the index object of this dataframe
import pandas as pd

df = pd.read_csv('datasets/sample2.csv')
df.index
df


Unnamed: 0,rollno,gender,group,age,math,english,urdu
0,MS01,female,group B,28.0,72.0,72,74.0
1,MS02,female,group C,33.0,69.0,90,88.0
2,MS03,female,group B,21.0,,95,93.0
3,MS04,male,group A,44.0,47.0,57,44.0
4,MS05,male,group C,54.0,76.0,78,
5,MS06,female,group B,,71.0,83,78.0
6,MS07,female,group B,47.0,88.0,95,92.0
7,MS08,male,group B,33.0,40.0,43,39.0
8,MS09,male,group D,27.0,64.0,64,67.0
9,MS10,female,group B,33.0,38.0,60,50.0


### a. Change index of a dataframe to some random values

In [15]:
import pandas as pd
import random
df = pd.read_csv('datasets/sample2.csv')
random_list = [random.randint(0, 49) for i in range(50)] 

# Now let us set this list set the index of our data frame
df.index = random_list
df.head()

Unnamed: 0,rollno,gender,group,age,math,english,urdu
47,MS01,female,group B,28.0,72.0,72,74.0
37,MS02,female,group C,33.0,69.0,90,88.0
36,MS03,female,group B,21.0,,95,93.0
45,MS04,male,group A,44.0,47.0,57,44.0
24,MS05,male,group C,54.0,76.0,78,


### b. Set another column of the dataframe as the index
- We can have both numerical as well as categorical variables as index of a dataframe
- We use set_index() function to change index of a dataframe to some other column
```
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```
Where
    - keys is the column label 
    - drop=True, to drop the column that's set as new index
    - append=False, Set it to True if you want to append columns to existing index
    - inplace=False, Set it to True to make changes in the original dataframe, i.e., do not create a new object
    - verify_integrity=False, Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

Returns Dataframe if inplace=False or None if inplace=True


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html



In [16]:
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
df.set_index('rollno', drop=True, inplace=True)
df.head()

Unnamed: 0_level_0,gender,group,age,math,english,urdu
rollno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MS01,female,group B,28.0,72.0,72,74.0
MS02,female,group C,33.0,69.0,90,88.0
MS03,female,group B,21.0,,95,93.0
MS04,male,group A,44.0,47.0,57,44.0
MS05,male,group C,54.0,76.0,78,


In [17]:
df.index

Index(['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06', 'MS07', 'MS08', 'MS09',
       'MS10', 'MS11', 'MS12', 'MS13', 'MS14', 'MS15', 'MS16', 'MS17', 'MS18',
       'MS19', 'MS20', 'MS21', 'MS22', 'MS23', 'MS24', 'MS25', 'MS26', 'MS27',
       'MS28', 'MS29', 'MS30', 'MS31', 'MS32', 'MS33', 'MS34', 'MS35', 'MS36',
       'MS37', 'MS38', 'MS39', 'MS40', 'MS41', 'MS42', 'MS43', 'MS44', 'MS45',
       'MS46', 'MS47', 'MS48', 'MS49', 'MS50'],
      dtype='object', name='rollno')

### c. Change the index back to positional?
- Use reset_index() to  reset the index

In [18]:
# reset the index
df.reset_index(inplace=True)

In [19]:
# view the top rows of the data
df.head()

Unnamed: 0,rollno,gender,group,age,math,english,urdu
0,MS01,female,group B,28.0,72.0,72,74.0
1,MS02,female,group C,33.0,69.0,90,88.0
2,MS03,female,group B,21.0,,95,93.0
3,MS04,male,group A,44.0,47.0,57,44.0
4,MS05,male,group C,54.0,76.0,78,


## 4. Use of loc and iloc attribute
- Although we can access rows of a dataframe using `df[]` syntax, however, `df.loc[]` and `df.iloc[]` provides simpler syntax over `df[]`
- **loc** is used to select rows of a dataframe by **index label**
- **loc** can use following within the [ ]
    - place a single label (e.g., 5 or 'a'), interpreted as label of the index 
    - slice object with labels (contrary to usual Python slices, both the start and stop are included) 
    - list of array of labels ['A9', 'A2', 'A7'] or [9, 2, 7]
    - a Boolean array (any NA values will be treated as False)
    - A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).



- **iloc** is used to select rows of a dataframe by **position**
- **iloc** can use following within the [ ]
    - place a single integer (e.g., 5), interpreted as row# (positional index) of the dataframe 
    - slice object with integers (stop is NOT inclusive) 
    - list of array of integers [9, 2, 7]
    - a Boolean array (any NA values will be treated as False)
    - A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

### a. Creating a basic dataframe from scratch

In [20]:
# Let us create a simple data frame to have a clear understanding about the working of loc and iloc 
# as well as the difference between the two.
import pandas as pd
sample_df = pd.DataFrame({
    'name' : ['Kamal', 'Saima', 'Jamal', 'Shaikh', 'Farzana'],
    'gender' : ['M', 'F', 'M', 'M', 'F'],
    'grade'  : ['A', 'A', 'B', 'B', 'A'],
    'marks'  : [ 22,  21,  12,  14,  20],
    'id'     : ['A101', 'A102', 'A103', 'A104', 'A105'],
    'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']
})
sample_df

Unnamed: 0,name,gender,grade,marks,id,city
0,Kamal,M,A,22,A101,Lahore
1,Saima,F,A,21,A102,Peshawer
2,Jamal,M,B,12,A103,Lahore
3,Shaikh,M,B,14,A104,Karachi
4,Farzana,F,A,20,A105,Peshawer


### b. Slicing with loc and iloc in a dataframe having positional index

In [21]:
# Now, when we define a dataframe by default the index is a range of numbers. 
# Let's see what happens if try to slice the dataframe using both `loc` and `iloc`
sample_df.loc[2:4]   #both start and stop are inclusive

Unnamed: 0,name,gender,grade,marks,id,city
2,Jamal,M,B,12,A103,Lahore
3,Shaikh,M,B,14,A104,Karachi
4,Farzana,F,A,20,A105,Peshawer


In [22]:
sample_df.iloc[2:4] # start is inclusive, however, stop is NOT inclusive

Unnamed: 0,name,gender,grade,marks,id,city
2,Jamal,M,B,12,A103,Lahore
3,Shaikh,M,B,14,A104,Karachi


In [23]:
# Let us sort the data by marks, note the indices are shuffled accordingly
sample_df = sample_df.sort_values(by=['marks'])
sample_df

Unnamed: 0,name,gender,grade,marks,id,city
2,Jamal,M,B,12,A103,Lahore
3,Shaikh,M,B,14,A104,Karachi
4,Farzana,F,A,20,A105,Peshawer
1,Saima,F,A,21,A102,Peshawer
0,Kamal,M,A,22,A101,Lahore


In [24]:
# Let's try to slice the dataframes again using loc, by placing the label
# Note when we slice using loc, the value given is interpreted as label of the index and 
# contrary to Python both start and stop are inclusive
sample_df.loc[2:4]

Unnamed: 0,name,gender,grade,marks,id,city
2,Jamal,M,B,12,A103,Lahore
3,Shaikh,M,B,14,A104,Karachi
4,Farzana,F,A,20,A105,Peshawer


In [25]:
# Let's try to slice the dataframes again using iloc, by placing integer values
# Note when we slice using iloc, the value given is interpreted as value of the index and 
# start is inclusive, while stop is NOT inclusive
sample_df.iloc[2:4]

Unnamed: 0,name,gender,grade,marks,id,city
4,Farzana,F,A,20,A105,Peshawer
1,Saima,F,A,21,A102,Peshawer


### c. Slicing with loc and iloc in a dataframe having categorical variable as index

In [26]:
# Let's change the index of the dataframe to a categorical variable (instead of integer)
sample_df.set_index('id',inplace=True)
sample_df

Unnamed: 0_level_0,name,gender,grade,marks,city
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A103,Jamal,M,B,12,Lahore
A104,Shaikh,M,B,14,Karachi
A105,Farzana,F,A,20,Peshawer
A102,Saima,F,A,21,Peshawer
A101,Kamal,M,A,22,Lahore


In [27]:
#Now, try to slice the dataframe using the loc and iloc function
# iloc still works fine as it uses the integer values as row/record number inside the dataframe
sample_df.iloc[2:4]

Unnamed: 0_level_0,name,gender,grade,marks,city
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A105,Farzana,F,A,20,Peshawer
A102,Saima,F,A,21,Peshawer


In [28]:
#Now loc gives us an error, because there is no index with the name 2 and 4 in the dataframe
# sample_df.loc[2:4]

# works fine, because these are the index/labels that exist in the dataframe 
# do note the sequence of start and stop labels
sample_df.loc['A104':'A102'] 

Unnamed: 0_level_0,name,gender,grade,marks,city
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A104,Shaikh,M,B,14,Karachi
A105,Farzana,F,A,20,Peshawer
A102,Saima,F,A,21,Peshawer


### d. Understanding label of loc and index of iloc

In [29]:
# Let us now again reset the index to positional
sample_df.reset_index(inplace=True)
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
0,A103,Jamal,M,B,12,Lahore
1,A104,Shaikh,M,B,14,Karachi
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer
4,A101,Kamal,M,A,22,Lahore


In [30]:
# Let us sort by column city
sample_df = sample_df.sort_values(by=['city'])
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer


In [31]:
# Use of loc will return the series containing record/row related with label value 3
s1 = sample_df.loc[3]    
s1


id            A102
name         Saima
gender           F
grade            A
marks           21
city      Peshawer
Name: 3, dtype: object

In [32]:
# Use of iloc will return the series containing record/row at positional index 3
s1 = sample_df.iloc[3]    
s1

id            A105
name       Farzana
gender           F
grade            A
marks           20
city      Peshawer
Name: 2, dtype: object

### e. Using list of  labels and integers with loc and iloc respectively

In [33]:
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer


In [34]:
# With loc we can use list of labels, and it returns a dataframe related with the list of labels
d1 = sample_df.loc[[3,0]]     
d1

Unnamed: 0,id,name,gender,grade,marks,city
3,A102,Saima,F,A,21,Peshawer
0,A103,Jamal,M,B,12,Lahore


In [35]:
# With iloc we can use list of array of integers, and it returns a dataframe related with the list of integers
# Note that the integer values are interpreted as row# (positional index) of the dataframe 
d1 = sample_df.iloc[[3,0]]     
d1

Unnamed: 0,id,name,gender,grade,marks,city
2,A105,Farzana,F,A,20,Peshawer
1,A104,Shaikh,M,B,14,Karachi


### f. Subsetting specific rows with specific columns with loc and iloc

In [36]:
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer


In [37]:
# Apart from rows you can also get specific columns using iloc
# Suppose you want rows having index 3 and 0 and do not want all the columns rather only 1 and 5

d1 = sample_df.iloc[[3, 0], [1, 5]]
d1

Unnamed: 0,name,city
2,Farzana,Peshawer
1,Shaikh,Karachi


In [38]:
# Apart from rows you can also get specific columns using loc
# Suppose you want rows having index 3 and 0 and do not want all the columns rather only name and city

d1 = sample_df.loc[[3, 0], ['name', 'city']]
d1

Unnamed: 0,name,city
3,Saima,Peshawer
0,Jamal,Lahore


### g. Selecting rows based on a condition



In [39]:
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer


In [40]:
# Selection based on single condition
d1 = sample_df.loc[sample_df.city == 'Lahore']
d1

Unnamed: 0,id,name,gender,grade,marks,city
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore


### h. Selecting rows based on multiple conditions
- When passing multiple conditions make sure that you put each of the condition in a parenthesis () and join them using &, |, and ! operators

In [41]:
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer


In [42]:
# Select the records of students who belong to Lahore and have marks>15
sample_df.loc[(sample_df.city == 'Lahore') & (sample_df.marks > 15)]

Unnamed: 0,id,name,gender,grade,marks,city
4,A101,Kamal,M,A,22,Lahore


In [43]:
# Select the records of students who belong to Lahore or Karachi
sample_df.loc[(sample_df.city == 'Lahore') | (sample_df.city == 'Karachi')]

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore


In [44]:
# A better way to filter out rows as in above case is using the isin() function to check 
# Whether elements in Series are contained in `values`.
# isin() function return a boolean Series showing whether each element in the Series matches an element in the passed sequence of `values` exactly.

sample_df[sample_df.city.isin(['Lahore', 'Karachi'])]

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore


### i. Conditional selection and viewing specific columns

In [45]:
sample_df

Unnamed: 0,id,name,gender,grade,marks,city
1,A104,Shaikh,M,B,14,Karachi
0,A103,Jamal,M,B,12,Lahore
4,A101,Kamal,M,A,22,Lahore
2,A105,Farzana,F,A,20,Peshawer
3,A102,Saima,F,A,21,Peshawer


In [46]:
cols = ['name', 'marks', 'city']
d1 = sample_df.loc[(sample_df.city == 'Lahore') & (sample_df.marks > 10)]
d1
d2 = sample_df.loc[(sample_df.city == 'Lahore') & (sample_df.marks > 10), cols]
d2

Unnamed: 0,name,marks,city
0,Jamal,12,Lahore
4,Kamal,22,Lahore


## 5. Selecting columns of a specific data type
- We can use the `select_dtypes(include=None, exclude=None)` function of dataframe, 
- Returns the subset of the dataframe's including the dtypes in include and excluding the dtypes in exclude
- include and exclude can be scalar or list-like
- atleast one of these parameters must be supplied

In [47]:
# Let us load the dataset for this exercise
import pandas as pd
import random
df = pd.read_csv('datasets/sample2.csv')
df

Unnamed: 0,rollno,gender,group,age,math,english,urdu
0,MS01,female,group B,28.0,72.0,72,74.0
1,MS02,female,group C,33.0,69.0,90,88.0
2,MS03,female,group B,21.0,,95,93.0
3,MS04,male,group A,44.0,47.0,57,44.0
4,MS05,male,group C,54.0,76.0,78,
5,MS06,female,group B,,71.0,83,78.0
6,MS07,female,group B,47.0,88.0,95,92.0
7,MS08,male,group B,33.0,40.0,43,39.0
8,MS09,male,group D,27.0,64.0,64,67.0
9,MS10,female,group B,33.0,38.0,60,50.0


In [48]:
# Let us first check the data types of each column
df.dtypes

rollno      object
gender      object
group       object
age        float64
math       float64
english      int64
urdu       float64
dtype: object

In [49]:
# Select the columns with object data type (categorical variables) only`
df.select_dtypes(include='object').head()

Unnamed: 0,rollno,gender,group
0,MS01,female,group B
1,MS02,female,group C
2,MS03,female,group B
3,MS04,male,group A
4,MS05,male,group C


In [50]:
# Select the columns with float64 datatype
df.select_dtypes(include='float64').head()

Unnamed: 0,age,math,urdu
0,28.0,72.0,74.0
1,33.0,69.0,88.0
2,21.0,,93.0
3,44.0,47.0,44.0
4,54.0,76.0,


In [51]:
# Select the columns with int64 datatype
df.select_dtypes(include='int64').head()

Unnamed: 0,english
0,72
1,90
2,95
3,57
4,78


## 6. Practice Session on Filtering Data 

In [52]:
# Let us create a simple data frame to have a clear understanding about the working of loc and iloc 
# as well as the difference between the two.
import pandas as pd
df = pd.DataFrame({
    'artist' : ['Atif Aslam', 'Nusrat Fateh Ali', 'Ali Zaffar', 'Nazia Hassan', 'Abida Parveen', 'Rahat Fateh Ali', 'Hadiqa Kiani'],
    'city' : ['Lahore', 'Karachi', 'Islamabad', 'Lahore', 'Peshawer', 'Quetta', 'Karachi'],
    'album_count'  : [23, 31, 42, 38, 41, 36, 25],
    'genre'  : ['Rock', 'Folk', 'Rock', 'Disco', 'Folk', 'Classical', 'Jaaz']
})
df


Unnamed: 0,artist,city,album_count,genre
0,Atif Aslam,Lahore,23,Rock
1,Nusrat Fateh Ali,Karachi,31,Folk
2,Ali Zaffar,Islamabad,42,Rock
3,Nazia Hassan,Lahore,38,Disco
4,Abida Parveen,Peshawer,41,Folk
5,Rahat Fateh Ali,Quetta,36,Classical
6,Hadiqa Kiani,Karachi,25,Jaaz


In [53]:
# Filter out records of musicians who play Folk music and have an album count >=10

out = df[(df['genre']=='Folk') & (df['album_count']>=10)] # use [ ] operator

out = df.loc[(df.genre=='Folk') & (df.album_count>=10)]   # use .loc[ ]
out


Unnamed: 0,artist,city,album_count,genre
1,Nusrat Fateh Ali,Karachi,31,Folk
4,Abida Parveen,Peshawer,41,Folk


In [54]:
# Filter out records of musicians who does not belong to Karachi

out = df.loc[df.city !='Karachi']   
out


Unnamed: 0,artist,city,album_count,genre
0,Atif Aslam,Lahore,23,Rock
2,Ali Zaffar,Islamabad,42,Rock
3,Nazia Hassan,Lahore,38,Disco
4,Abida Parveen,Peshawer,41,Folk
5,Rahat Fateh Ali,Quetta,36,Classical


In [55]:
# Filter out records of Artists based on two conditions:
# lives outside Karachi, who have > 30 plays or who lives in Lahore
out = df.loc[((df.city != 'Karachi') & (df.album_count > 30)) | ((df.city == 'Lahore'))]
out



Unnamed: 0,artist,city,album_count,genre
0,Atif Aslam,Lahore,23,Rock
2,Ali Zaffar,Islamabad,42,Rock
3,Nazia Hassan,Lahore,38,Disco
4,Abida Parveen,Peshawer,41,Folk
5,Rahat Fateh Ali,Quetta,36,Classical


In [56]:
# Filter out records of Artists whose name has string "Ali"

out = df.loc[df.artist.str.contains('Ali')]
out

Unnamed: 0,artist,city,album_count,genre
1,Nusrat Fateh Ali,Karachi,31,Folk
2,Ali Zaffar,Islamabad,42,Rock
5,Rahat Fateh Ali,Quetta,36,Classical


In [57]:
out = df.loc[df.artist.str.startswith('Ali')]
out

Unnamed: 0,artist,city,album_count,genre
2,Ali Zaffar,Islamabad,42,Rock


# Misc Mathematical Functions

In [58]:
import pandas as pd
data = {'name': ['Rauf', 'Hadeed', 'Maaz', 'Mujahid', 'Arif'], 
        'age': [42, 52, 36, 24, 73], 
        'subj1': [4, 24, 31, 2, 3],
        'subj2': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'subj1', 'subj2'])
df

Unnamed: 0,name,age,subj1,subj2
0,Rauf,42,4,25
1,Hadeed,52,24,94
2,Maaz,36,31,57
3,Mujahid,24,2,62
4,Arif,73,3,70


In [59]:
df['subj1'].max()

31

In [60]:
df['subj1'].min()

2

In [61]:

df['subj1'].sum()

64

In [62]:
df['subj1'].cumsum()

0     4
1    28
2    59
3    61
4    64
Name: subj1, dtype: int64

In [63]:
df['subj1'].mean()

12.8

In [64]:
df['subj1'].median()

4.0

In [65]:
df['subj1'].std()

13.663820841916802

In [66]:
df['subj1'].mode()

0     2
1     3
2     4
3    24
4    31
dtype: int64

In [67]:
#Summary statistics
df.describe()

Unnamed: 0,age,subj1,subj2
count,5.0,5.0,5.0
mean,45.4,12.8,61.6
std,18.460769,13.663821,24.905823
min,24.0,2.0,25.0
25%,36.0,3.0,57.0
50%,42.0,4.0,62.0
75%,52.0,24.0,70.0
max,73.0,31.0,94.0


In [68]:
#Count the number of non-NA values
df['subj1'].count()

5

In [69]:
#Sample variance of TestScore_1 values
df['subj1'].var()

186.7

In [70]:
#Sample standard deviation of TestScore_1 values
df['subj1'].std()

13.663820841916802

In [71]:
#Skewness of preTestScore values
df['subj1'].skew()


0.7433452457326751

In [72]:
#Kurtosis of TestScore_1 values
df['subj1'].kurt()

-2.4673543738411547

In [73]:
#Correlation Matrix Of Values
df.corr()

Unnamed: 0,age,subj1,subj2
age,1.0,-0.105651,0.328852
subj1,-0.105651,1.0,0.378039
subj2,0.328852,0.378039,1.0


In [74]:
#Covariance Matrix Of Values
df.cov()

Unnamed: 0,age,subj1,subj2
age,340.8,-26.65,151.2
subj1,-26.65,186.7,128.65
subj2,151.2,128.65,620.3
