# Selecting Columns and Rows 

## Objectives:

Columns:
**_Selecting_** and **_Working with_** columns from a dataframe


Rows: 
**_Understanding the Difference_** between **_loc and iloc_** methods, and **_Selecting_** rows from a dataframe


### Key Concepts
It is an essential part of working with data to able to select specific parts of a dataset. For example in order to fill in missing data in a particular column(s) this skill comes in very handy. This means that you should be very comfortable with the syntax of selecting rows and columns.

| Command                       | Description                                           |
|-------------------------------|-------------------------------------------------------|
| df[col]                       | select one column as a Series                         |
| df[[col]]                     | select one column as a DataFrame                      |
| df[[col1, col2, ... ]]        | select 2+ columns as a DataFrame                      |
| df['column_name'] = new_values| assign new values to the column                        |
| df.drop()                     | drop specified rows or columns                        |
| df['column'].astype()         | cast a pandas column to a specified dtype             |
| df.loc[row]                   | select one row as a Series by index                   |
| df.loc[[row1, row2]]          | select 1+ rows as a DataFrame by index                |
| df.loc[[row], [col]]          | select rows and columns as a DataFrame by index       |
| df.iloc[a:b, c:d]             | select rows/columns by integer-location               |
| df.set_index()                | set selected column as index                           |


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/penguins_simple.csv', sep=";")

In [3]:
# look at the dataframe

df

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 1. Columns

### 1.1 Renaming the columns

In [None]:
df.columns =['species', 'culmen_length_mm','culmen_depth_mm', 'flipper_length_mm', 'body_mass_gg', 'sex']

In [None]:
df

### 1.2 Selecting a single column

In [None]:
# using single square brackets, we select a single column as a series
df['sex']

This will extract the column as a pd.Series.

In [None]:
type(df['sex']) # as we learned it in the pandas encounter, it is pandas series

In [None]:
# using double square brackets, we select a single columns as a dataframe
df[['sex']]

In [None]:
type(df[['sex']])

### 1.3 Selecting Multiple Columns

In [None]:
# when selecting multiple columns, we HAVE to use double square brackets
# and we get a dataframe back
df[['species','culmen_length_mm','sex']]

### 1.4 Changing column types

In [None]:
df.dtypes # Notice that dataframes may contain different data types and that's why here it is dtypes 

In [None]:
# we sometimes need to change the data type of a column, that is usually part of data cleaning

df['culmen_length_mm'] = df['culmen_length_mm'].astype(int)

In [None]:
# Let's check the datatypes again
df.dtypes

###  1.5 Creating columns

In [None]:
# Adding a new column
# For example, let's add a 'year_collected' column with a constant value
df['year_collected'] = 2024

How to round numbers in python:
`round(<number_to_round>, <number_of_decimal_points>)`

In [None]:
df

In [None]:
# Convert body_mass_g from grams to kilograms, round to 2 decimal places, and create a new column

df['body_mass_kg'] = (df['body_mass_gg'] / 1000.0).round(2)

In [None]:
# Now, df includes a new column 'body_mass_kg' with the body mass in kilograms, rounded to 2 decimal places
df

### 1.6 Dropping columns

In [None]:
# dropping a single column

df.drop('year_collected', axis='columns')
#df.drop('year_collected', axis=1)

In [None]:
# note that the dataframe did not change!
df

Uh-oh! Dropping returned a (changed) copy of the dataframe, but didn't change the original!

To make the changes stick, you can:
* assign the result to another dataframe
* use the `inplace=True` parameter

In [None]:
# assign the result to another datafram
df_new = df.drop('year_collected', axis=1) # notice that instead of columns we used axis=1 which is the same thing

In [None]:
df_new

In [None]:
# to drop from the dataframe we use the inplace=True parameter
df.drop('year_collected', axis='columns', inplace=True)

In [None]:
df

## 2. Selecting rows (and columns): `.loc[]` and `.iloc[]` methods

A brief slicing recap:

In [None]:
a = [1,2,3,4,5,6]

In [None]:
a

Reminder: slicing syntax is `a[start:end:step]`
* If not specified, `start` is the beginning of the list.
* If not specified, `end` is the end of the list.
* You can use minus to count from the back, e.g. the second element from the back is `a[-2]`
* If not specified, `step` is 1.

In [None]:
# select first 4 numbers
a[:4]

In [None]:
a[1:4]

In [None]:
# what does this do?
a[::3]
# it gives us every third element in the list

### 2.1 Selecting based on a single value (select a single row/column)

In [None]:
df

## loc[ ]

### Here's a quick guide on how to use loc[]:

    - Select a single row: df.loc['index_label'] 
    - Select multiple rows: df.loc[['label1', 'label2']] 
    - Select rows by range of labels: df.loc['label1':'label3'] 
    - Conditional selection: df.loc[df['column'] > value]
    - Select specific rows and columns: df.loc[['row1', 'row2'], ['column1', 'column2']] 
    - Remember, loc[] operates on the DataFrame's index labels and column names, making it particularly suited for DataFrames where these labels are meaningful or when performing operations based on the data's content rather than its position.

In [None]:
# Example 1: Select the first row
df.loc[0, :]

In [None]:
# Example 2: Select first two rows
df.loc[0:1, :]

In [None]:
df

In [None]:
df.loc[df['culmen_length_mm'] > 40] 

In [None]:
#If you want to select all rows where the species is "Adelie", you can use:

adelie_penguins = df.loc[df['species'] == 'Adelie']
adelie_penguins

**_How does loc[ ] work?_**

- Notice that loc is including both 0 and 1 rows
- Loc is using the labels of the row. Let's make some changes to understand it better

In [None]:
# changing the index column of the dataframe
df.set_index('species', inplace=True) 

# It is also possible to assign a column as index, when we are reading the data


In [None]:
df.head()

In [None]:
# Example 1: Select the second row

# we will get an error now, because we changed the index labels in species . we should give the new labels
df.loc[1, :] 

In [None]:
df.loc['Adelie', :] # the new index labels is the species. loc uses the labels.

In [None]:
# Example 2: Select rows that contain Adelie and Chinstrap Species
df.loc['Adelie':'Chinstrap', :]

## iloc[ ]


+ Te iloc[] function in pandas is used for integer-location based indexing, which means it selects rows and columns using integer indices. Here are some scenarios when you might use iloc():




### Here's a quick guide on how to use iloc():

    - Select a single row: df.iloc[5]
    - Select multiple rows: df.iloc[5:10] 
    - Select rows and specific columns: df.iloc[5:10, 0:2] 
    - Select specific rows and all columns: df.iloc[[1, 3, 7], :] 

In [None]:

df.iloc[5] #(selects the row at position 5)

In [None]:
df.iloc[5:10] #(selects rows at positions 5 to 9)

In [None]:
df.iloc[5:10, 0:2]  #(selects rows at positions 5 to 9 and columns at positions 0 to 1)

In [None]:
 df.iloc[[1, 3, 7], :] #(selects rows at positions 1, 3, and 7, along with all columns)

- The **.loc** method is **label-based**: 
    - You have to specify `rows` and `columns` based on their row and column **labels**. 
    - It has the following syntax: `df.loc[row_label, column_label]`


- The **.iloc** is integer **position-based**:
    - You have to specify `rows` and `columns` by their **integer position values** (0-based integer position). 
    - It has the following syntax: `df.iloc[row_position, column_position]`