# Filtering data frame (Important topic)

Filtering is using __conditional statements__ to narrow your search. e.g. I need jacket that has 4 star rating or above. you can apply filtering that shows jackets that only has 4 start rating or above

**Filtering is about to data present in the same column. So it filters the rows of that column**

Since filtering is using conditional statement (one that needs logical operators such as <, >, ==, and, or, not, !=), your result contain logical values (True and False)

```python
stats.InternetUsers.head() <2
#output
0    False
1    False
2    False
3    False
4    False
```

## Benefit of filtering
- **Allows us to create subsets that has only those values that meet True condition** False values will not be part of your subset.

```python
stats[stats.InternetUsers <2]
```
__Output__

| S.No. | CountryName |	CountryCode	| BirthRate | InternetUsers | IncomeGroup |
|:-------|:--------------|:-------------|:-------:|:------------:|:--------|
| 11 |	Burundi |	BDI |	44.151 | 1.3 |	Low income |
| 52 |	Eritrea |	ERI	 |34.80 | 0.9	 |Low income |
| 55 |	Ethiopia |	ETH |	32.925 | 1.9 |	Low income |
| 64 |	Guinea |	GIN |	37.337 |	1.6	| Low income |
| 117 |	Myanmar |	MMR |	18.119 |	1.6 |	Lower middle income |
| 127 |	Niger |	NER |	49.661 |	1.7 |	Low income |
| 154 |	Sierra Leone |	SLE |	36.729 |	1.7 |	Low income |
| 156 |	Somalia	 | SOM	| 43.891 |	1.5 |	Low income |
| 172 |	Timor-Leste |	TLS	 |35.755 |	1.1	| Lower middle income |


### What does this filter tell us?

- name of these countries that fall in that filter bracket
- GDP per capita (Income Group) of these countries 
- also can draw correlation between data (such as lower income is correlated to lower internet users) or (Lower income group is correalted to higher birthrate)

**Tip:**

use `dataframe.describe()` 

This will give you the stats for the numeric columns. Stats are things like median(50%) , mean value , max, min etc.

e.g. If you know that the median for the birth rate is 41. The you can use that as to check which countries are above the median 

```python
stats[stats.BirthRate > 41]
```

The output contains 9 countries and almost all these countries are from African continent.


## Filtering using '&' and '|' 

$&$ is called ampersand and is used for AND operation. 

$|$ is called pipe and is used for OR operation

- $&$ means both filter conditions need to be True for the result to be True

- $|$ means one of the condition of two needs to be True for the result to be True.

### AND operation

```python
stats[(stats.InternetUsers < 2) & (stats.BirthRate > 40)]
```
__Note__ 

1. use parenthesis $( )$ for each condition 

```python
stats[stats.BirthRate > 40 & stats.InternetUsers < 2]
#output
ERROR (python cannot understand this logical operation as to python it is ambiguous)
```

1. Use ampersand (&) instead of $and$
```python
stats[(stats.InternetUsers < 2) and (stats.BirthRate > 40)]
#output
ERROR (The truth value of a Series is ambiguous.)
```

### OR operation 

```python
stats[(stats.InternetUsers < 2) | (stats.BirthRate > 40)]
```

Again, use parenthesis for each conditional statement and use pipe symbol $( | )$ instead of $or$


## How to know the categoical values of a column

What does categorical values mean?

Let's say you have column 'IncomeGroup' - Now most of its values fall in four categories ('High income', 'Low Income', 'Upper middle income', 'Lower middle income')

In order to know that - use `dataframe.column_name.unique()`

```python
stats.IncomeGroup.unique()
#output
array(['High income', 'Low income', 'Upper middle income',
       'Lower middle income'], dtype=object)
```

In [19]:
import pandas as pd
import os

os.getcwd()

stats = pd.read_csv('demographic.csv')

stats

#Changing column name

stats.columns = ['CountryName', 'CountryCode', 'BirthRate', 'InternetUsers',
       'IncomeGroup']

#print 
stats

#observing data

stats.head()
stats.CountryName.head()
stats.CountryName[4:9]

#getting more than one column
stats[['CountryName', 'CountryCode', 'BirthRate']][4:20]

#mathematical operations and adding new column

stats["New"] = stats.BirthRate + stats.InternetUsers

#print
stats

#deleting new column
stats.drop('New',axis = 1,inplace = True) #except for the label, mention the name of the parameter

stats #inplace = True will make changes in the original data frame.

#but if you don't use parameter, inplace = T, then you will need to over-write existing data frame to make changes in it.


Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income
...,...,...,...,...,...
190,"Yemen, Rep.",YEM,32.947,20.0,Lower middle income
191,South Africa,ZAF,20.850,46.5,Upper middle income
192,"Congo, Dem. Rep.",COD,42.394,2.2,Low income
193,Zambia,ZMB,40.471,15.4,Lower middle income


In [20]:
stats.head()

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income


In [22]:
stats.InternetUsers.head() < 2

0    False
1    False
2    False
3    False
4    False
Name: InternetUsers, dtype: bool

In [23]:
stats[stats.InternetUsers <2]

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
11,Burundi,BDI,44.151,1.3,Low income
52,Eritrea,ERI,34.8,0.9,Low income
55,Ethiopia,ETH,32.925,1.9,Low income
64,Guinea,GIN,37.337,1.6,Low income
117,Myanmar,MMR,18.119,1.6,Lower middle income
127,Niger,NER,49.661,1.7,Low income
154,Sierra Leone,SLE,36.729,1.7,Low income
156,Somalia,SOM,43.891,1.5,Low income
172,Timor-Leste,TLS,35.755,1.1,Lower middle income


In [25]:
filter = stats.InternetUsers <2
filter


0      False
1      False
2      False
3      False
4      False
       ...  
190    False
191    False
192    False
193    False
194    False
Name: InternetUsers, Length: 195, dtype: bool

In [26]:
#How to use this filter

stats[filter]

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
11,Burundi,BDI,44.151,1.3,Low income
52,Eritrea,ERI,34.8,0.9,Low income
55,Ethiopia,ETH,32.925,1.9,Low income
64,Guinea,GIN,37.337,1.6,Low income
117,Myanmar,MMR,18.119,1.6,Lower middle income
127,Niger,NER,49.661,1.7,Low income
154,Sierra Leone,SLE,36.729,1.7,Low income
156,Somalia,SOM,43.891,1.5,Low income
172,Timor-Leste,TLS,35.755,1.1,Lower middle income


In [31]:
#But before you do this 
#get the summary of this data frame

stats.describe()

#birth rate max value is 49.66

#What is birth rate?
#birth per 1000people per year

Unnamed: 0,BirthRate,InternetUsers
count,195.0,195.0
mean,21.469928,42.076471
std,10.605467,29.030788
min,7.9,0.9
25%,12.1205,14.52
50%,19.68,41.0
75%,29.7595,66.225
max,49.661,96.5468


In [36]:
result = stats[stats.BirthRate > 41] #41 is the median of Birth rate

len(result)

9

In [None]:
#having more than one filter
stats[stats.BirthRate > 40 and stats.InternetUsers < 2]  #Error

#why we are getting the error?
#Python is having difficulty combining two logical operations. 

#To it, it is ambiguous.

In [44]:
stats.BirthRate > 40 & stats.InternetUsers < 2

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

In [58]:
filter1 = stats.InternetUsers < 2
filter2 = stats.BirthRate > 40

filter1 & filter2

stats[(stats.InternetUsers < 2) | (stats.BirthRate > 40)]

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
2,Angola,AGO,45.985,19.1,Upper middle income
11,Burundi,BDI,44.151,1.3,Low income
14,Burkina Faso,BFA,40.551,9.1,Low income
52,Eritrea,ERI,34.8,0.9,Low income
55,Ethiopia,ETH,32.925,1.9,Low income
64,Guinea,GIN,37.337,1.6,Low income
65,"Gambia, The",GMB,42.525,14.0,Low income
115,Mali,MLI,44.138,3.5,Low income
117,Myanmar,MMR,18.119,1.6,Lower middle income
127,Niger,NER,49.661,1.7,Low income


In [56]:
stats[stats.IncomeGroup == 'High income']

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
0,Aruba,ABW,10.244,78.90,High income
4,United Arab Emirates,ARE,11.044,88.00,High income
5,Argentina,ARG,17.716,59.90,High income
7,Antigua and Barbuda,ATG,16.447,63.40,High income
8,Australia,AUS,13.200,83.00,High income
...,...,...,...,...,...
174,Trinidad and Tobago,TTO,14.590,63.80,High income
180,Uruguay,URY,14.374,57.69,High income
181,United States,USA,12.500,84.20,High income
184,"Venezuela, RB",VEN,19.842,54.90,High income


In [59]:
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   CountryName    195 non-null    object 
 1   CountryCode    195 non-null    object 
 2   BirthRate      195 non-null    float64
 3   InternetUsers  195 non-null    float64
 4   IncomeGroup    195 non-null    object 
dtypes: float64(2), object(3)
memory usage: 7.7+ KB


In [60]:
stats.IncomeGroup.unique()

array(['High income', 'Low income', 'Upper middle income',
       'Lower middle income'], dtype=object)

In [63]:
stats[stats.CountryName == 'Malta']

stats[stats.CountryCode == 'MLT']

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
116,Malta,MLT,9.5,68.9138,High income
