## Segment 1 - Filtering and selecting data

In [2]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

### Selecting and retrieving data
You can write an index value in two forms.
- Label index or
- Integer index

In [3]:
s1 = Series(np.arange(8))
s1

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
dtype: int64

In [11]:
s1[6]

np.int64(6)

In [7]:
# series_obj = Series(np.arange(8),index=['row1', 'row2', 'row3', 'row4', 'row5', 'row6','row7', 'row8'])
# series_obj
series_obj = pd.Series(np.arange(8),index=['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6','row 7', 'row 8'])
series_obj

row 1    0
row 2    1
row 3    2
row 4    3
row 5    4
row 6    5
row 7    6
row 8    7
dtype: int64

In [16]:
series_obj.loc['row 6']

np.int64(5)

In [13]:
#searching
series_obj[['row 1','row 4']]

row 1    0
row 4    3
dtype: int64

In [14]:
np.random.seed(25)
DF_obj = pd.DataFrame(np.random.rand(36).reshape((6,6)), index=['row 1', 'row 2', 'row 3', 'row 4','row 5','row 6'], columns=['column 1','column 2','column 3','column 4','column 5','column 6'])
DF_obj

Unnamed: 0,column 1,column 2,column 3,column 4,column 5,column 6
row 1,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
row 2,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
row 3,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
row 4,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
row 5,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
row 6,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


In [17]:
DF_obj.loc[['row 2', 'row 5'], ['column 5', 'column 2']]

Unnamed: 0,column 5,column 2
row 2,0.402366,0.437611
row 5,0.421004,0.559053


### Data slicing
You can use slicing to select and return a slice of several values from a data set. Slicing uses index values so you can use the same square brackets when doing data slicing.

How slicing differs, however, is that with slicing you pass in two index values that are separated by a colon. The index value on the left side of the colon should be the first value you want to select. On the right side of the colon, you write the index value for the last value you want to retrieve. When you execute the code, the indexer then simply finds the first record and the last record and returns every record in between them.

In [18]:
series_obj.loc['row 2' : 'row 5']

row 2    1
row 3    2
row 4    3
row 5    4
dtype: int64

In [21]:
series_obj.iloc[1:5]

row 2    1
row 3    2
row 4    3
row 5    4
dtype: int64

### Comparing with scalars
Now we're going to talk about comparison operators and scalar values. Just in case you don't know that a scalar value is, it's basically just a single numerical value. You can use comparison operators like greater than or less than to return true/false values for all records to indicate how each element compares to a scalar value.

In [22]:
DF_obj < 0.5

Unnamed: 0,column 1,column 2,column 3,column 4,column 5,column 6
row 1,False,False,True,True,True,True
row 2,False,True,False,True,True,True
row 3,True,False,True,False,True,False
row 4,True,False,True,False,True,False
row 5,False,False,True,False,True,True
row 6,True,False,False,True,True,False


### Filtering with scalars

In [24]:
series_obj[series_obj > 4]

row 6    5
row 7    6
row 8    7
dtype: int64

### Setting values with scalars

In [25]:
series_obj['row 1'] = 8

series_obj

row 1    8
row 2    1
row 3    2
row 4    3
row 5    4
row 6    5
row 7    6
row 8    7
dtype: int64

Filtering and selecting using Pandas is one of the most fundamental things you'll do in data analysis. Make sure you know how to use indexing to select and retrieve records.

Class Exercise:
1) create DF1 having 8*8 size and values varing from 1 to 10
2) replace the values more than 7 with 10. display it at the end

In [28]:
# np.random.seed(25)
# data = np.random.randint(1,10,size=64)
# df1 = DataFrame(data.reshape((8,8)))
# df1
np.random.seed(25)
data = np.random.randint(1,10,size=64).reshape(8,8)
DF1 = pd.DataFrame(data, index=range(8),columns=range(8))
DF1

Unnamed: 0,0,1,2,3,4,5,6,7
0,5,7,8,3,9,5,5,6
1,2,8,4,9,8,4,5,4
2,2,7,1,1,3,6,5,1
3,2,7,2,3,5,3,6,9
4,1,7,5,3,6,7,1,3
5,5,4,2,4,6,8,3,4
6,8,8,9,4,6,6,6,4
7,3,9,4,1,6,9,5,4


In [29]:
DF1 > 7

Unnamed: 0,0,1,2,3,4,5,6,7
0,False,False,True,False,True,False,False,False
1,False,True,False,True,True,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,True,False,False
6,True,True,True,False,False,False,False,False
7,False,True,False,False,False,True,False,False


In [31]:
DF1[DF1 > 7] = 10
DF1

Unnamed: 0,0,1,2,3,4,5,6,7
0,5,7,10,3,10,5,5,6
1,2,10,4,10,10,4,5,4
2,2,7,1,1,3,6,5,1
3,2,7,2,3,5,3,6,10
4,1,7,5,3,6,7,1,3
5,5,4,2,4,6,10,3,4
6,10,10,10,4,6,6,6,4
7,3,10,4,1,6,10,5,4


Series can be instantiated from dicts

In [33]:
d = {"b": 1, "a": 0, "c": 2}
d

{'b': 1, 'a': 0, 'c': 2}

In [34]:
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [35]:
pd.Series(d,index=['b','c','a','d'])

b    1.0
c    2.0
a    0.0
d    NaN
dtype: float64

In [38]:
#name attribute
s = pd.Series(np.random.randint(1,10,size=10),name='My_series')

In [39]:
s

0    6
1    3
2    8
3    2
4    7
5    8
6    9
7    8
8    4
9    3
Name: My_series, dtype: int32

# Class exercise 1:
1. Create a Pandas Series of 5 fruit names:
["Apple", "Banana", "Mango", "Grapes", "Orange"]

2. Access the following elements:
The first element.
The last element.
The element at index 2.

3. Slice the Series to get:
The first 3 elements.
All elements except the first one.

4. Check if "Mango" is present in the Series.

In [24]:
Series_fruits = pd.Series(['Apple','Banana','Mango','Grapes','Orange'])

In [25]:
# First element
Series_fruits[0]

'Apple'

In [26]:
# Last element
Series_fruits.iloc[-1]

'Orange'

In [27]:
# Element at index 2
Series_fruits[2]

'Mango'

In [29]:
slice_first3 = Series_fruits[:3]
slice_first3

0     Apple
1    Banana
2     Mango
dtype: object

In [31]:
# All elements except the first one
slice_except_first = Series_fruits[1:]
slice_except_first

1    Banana
2     Mango
3    Grapes
4    Orange
dtype: object

In [32]:
# Check if "Mango" is present
is_mango_present = 'Mango' in Series_fruits.values
is_mango_present

True

# Class exercise 2:
1. Create a DataFrame using the following dictionary:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

2. Access elements and rows:

Get the first row of the DataFrame.

Access the 'Age' column.

Get the age of the person at index 3.

3. Slice the DataFrame:

Get the first 3 rows.

Get the 'Name' and 'City' columns for the last 2 rows.

4. Filter rows:
Get all rows where Age > 25.

In [36]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df_people = pd.DataFrame(data)
df_people

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston
4,Eve,29,Phoenix


In [35]:
# get the first row
df_people.iloc[0]

Name       Alice
Age           24
City    New York
Name: 0, dtype: object

In [37]:
# get the 'Age' in index 3
df_people['Age'].iloc[3]

np.int64(32)

In [38]:
# get the first 3 rows
df_people.iloc[:3]

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago


In [39]:
# get the 'Name' and 'City' columns for the last 2 rows
df_people[['Name','City']].iloc[-2:]

Unnamed: 0,Name,City
3,David,Houston
4,Eve,Phoenix


In [40]:
# get all rows where Age > 25
age_filter = df_people['Age'] > 25
df_people[age_filter]

Unnamed: 0,Name,Age,City
1,Bob,27,Los Angeles
3,David,32,Houston
4,Eve,29,Phoenix
