# ***Indexing, Selecting & Assigning***

### First thing first - import the library and read the csv data

In [None]:
# import pandas
import pandas as pd

# read data - here we are using a public dataset consist avocado prices
# original dataset link - http://www.hassavocadoboard.com/retail/volume-and-price-data

data = pd.read_csv('archive/avocado.csv')

data.drop(data.columns[[0]], axis=1, inplace=True)

print(f'Shape of the dataframe: {data.shape}')

In [None]:
# for ease of view 2 rows are shown here
data.head(2)

In [None]:
# details of the dataset - describes the column names, datatype of each column, how many non-null values are there
data.info()

In [None]:
# a quick look to the mean, variance, standard deviation, min value, max value 
data.describe()

# Accessing columns by their name

We've two methods to accessing columns by their names - 
1. The dot ( **.** ) operator
2. Using indexing ( [ ] ) operator

For example we want to access the region column, check the below syntaxes

In [None]:
## 1. the dot operator
regions = data.region

# OR

## 2. the indexing operator
regions = data['region']

regions

We have a column named - 'Total bags', this column cannot be accessed using dot operator.\
Here we have to use the indexing operator, e.g. - data.Total Bags ❌ || data['Total Bags'] ✅

# Indexing in Pandas

Pandas has 2 accessor operators named - ***loc*** and ***iloc***. These operators are a bit advanced but you'll be familiarized with them soon.

## Index based selection - iloc

Pandas indexing works in two paradigms - 1. index based, 2. label based. The first is index-based selection: selecting data based on its numerical position in the data. ***iloc*** follows this paradigm.

### Selecting rows:

In [None]:
# Syntax:
# any_variable_name = data.iloc[index]

# To select the first row of data in a DataFrame:
first_row = data.iloc[0]    # --> it'll return the first row in a well managed manner

# To select the second row of data in a DataFrame:
second_row = data.iloc[1]

print(first_row, second_row, sep='\n-----------------------------\n')

### Selecting Columns:

In [None]:
# Syntax:
# any_variable_name = data.iloc[: , index]
# ref - Python slicing method(https://towardsdatascience.com/the-basics-of-indexing-and-slicing-python-lists-2d12c90a94cf)

# To select the first column of data in a DataFrame:
first_column = data.iloc[: , 0]
# print(first_column)

# To select the second column of data in a DataFrame:
second_column = data.iloc[: , 1]
# print(second_column)

# Selecting first 5 rows of second column
first_five_from_sec_column = data.iloc[:5 , 1]
# print(first_five_from_sec_column)

# Selecting some specific rows from second column:
data.iloc[3:7, 1]   # either a range of rows can be passed or a list of indices

# let's say we want row with index - 1, 3, 6, 10, 50 from first column
data.iloc[[1, 3, 6, 10, 50], 0]

*** NOTE: Reverse index (e.g.: -1, -2, -3, -4 etc..) also works

## Label based selection

The second paradigm for attribute selection is the one followed by the ***loc*** operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [None]:
# For example, to get the first entry in data:
data.loc[0, ['region', 'Total Volume', 'AveragePrice']]

In [None]:
# To get specific rows of specific columns
data.loc[[1,13,25,16,47,210,69,380], ['region', 'Total Volume', 'AveragePrice']]

*** NOTE: iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

# Manipulating the index

In pandas dataframe the index field is not immutable. So, we can easily manipulate it by using the method ***.set_index()***.

In [None]:
# all columns in our dataframe
print(data.columns)

In [None]:
# let's set the index to the column 'Data':
data.set_index('Date')
# now index will be set to the values of Date column instead of 0,1,2... numeric series.

# This is useful if you can come up with an index for the dataset which is better than the current one.

# Conditional Selection:

Let's put some condition on our data. Why? Because, to do interesting and meaningful things with data we've to ask questions or put conditions on the data to filter things. For example - 

In [None]:
# this will return a series of boolean 'True' where it matches and 'False' otherwise.
data.region == 'Albany'

### Now show all data from the region 'Albany' 

In [None]:
# remember loc method? we'll be using it here
data.loc[data.region == 'Albany']

In [None]:
# more than 1 condition - and operation
data.loc[(data.region == 'Albany') & (data['AveragePrice'] > 1.4)]


In [None]:
# more than 1 condition - or operation
data.loc[(data.region == 'Albany') | (data['AveragePrice'] <= 1.2)]

The second is **isnull** (and its compliment **notnull**). These methods let you highlight values which are (or are not) empty (NaN). This ***.isnull()*** method returns the records where the specified column value is **null**. And its compliment ***.notnull()*** method helps you to select those records having **non-null** values in the specified column.

In [None]:
# as our data here is fully cleaned previously so, there is no null values - this line returns empty table
data.loc[data.AveragePrice.isnull()]

In [None]:
data.loc[data.AveragePrice.notnull()] # returns the whole dataset

# Assigning Data

### We can assign a new value to a column. For example - we can set the value of the total bags column to 0 --> "data['Total Bags'] = 0". 

In [None]:
# Before setting value
data['Total Bags']

In [None]:
data['Total Bags'] = 0

data['Total Bags']