<a href="https://colab.research.google.com/github/MevrouwHelderder/Assignments/blob/main/Selecting_data_in_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# importing (or Digesting) data into pandas. 
# read_csv is the most commonly used option. There are ways to customise what and how you want to see stuff. Most important are:
# sep or delimiter, used to tell what is used to seperate the values. 
# parse_dates, turns dates and times into actual dates and times
# thousands, used to tell what the thousand-seperator is in higher numbers

# read_* can be used for other things as well. Like: 
# reading from URL's, HTML tables from websites (for example directly from wikipedia!), excel files, clipboard, etc.
# see documentation

In [None]:
import numpy as np
import pandas as pd


###You can select data with .loc or .iloc. 
  * .loc has a bit more functions but iloc is faster so more suitable for lager datasets.
  * .loc slices up to AND INCLUDING
  * .iloc slices up to  

###You could also use the indexing operator but it is not advised.
  * it creates temporary datastructures that you don't keep around which negatively affect performance. It makes your code slower.  


In [None]:
# Assign the database:
# using ; as an seperator and , as a thousands-seperator because that is whats used in the original file. 
smoking = pd.read_csv("/content/drive/MyDrive/smoking-indicators.csv", sep=";", thousands=",")

# Call the database. Optional: showing an x number of lines instead of all the lines by using .head(number of lines)
smoking.head(3)


In [None]:
# Dropping some columns for easier use by using .drop
# First we specify the axis, then we specify what columns, in this case
# The axis= 1 means we drop from columns
# The axis= 0 means we drop from rows

# You can drop by using the name of the column:
smoking.drop("la", inplace=True, axis=1)

# or by slicing:
smoking.drop(axis=1, columns=smoking.columns[5:], inplace=True)

smoking.head(3)




In [None]:
# Rename colums for easier use and showing a bit more rows: 
smoking.columns = ["Borough", "Current smokers", "Ex smokers", "Never smoked", "Total"]
smoking.head(10)

In [None]:
# Selecting single rows
# .loc can be used in combination with an index []
smoking.loc[0]
# You can see it has been flipped so it shows a neat summary of this one row

In [None]:
# Selecting multiple rows can be done in a few ways. 
# You can select a row of indexs.
# Note the double brackets because on set of brackets belongs to the .loc and the other to the list:
smoking.loc[[0,1,2]]

In [None]:
# You can slice. Note that .loc and .iloc are different when it comes to including.
smoking.loc[0:2]

In [None]:
# You can also select rows and columns. You seperate them by a comma. 
# Rows first, then columns.
smoking.loc[0:5, "Current smokers"]

In [None]:
# Note: when slicing is not an option, because you need multiple columns or rows that don't follow up you make a list. Mind the brackets!
smoking.loc[0:5, ["Current smokers","Total"]]

#With name as index
#Only usable with .loc. Not with .iloc

In [None]:
# You can specify if you want something used as the index. For example, in this case, the name of the Borough.
# Make a new variable:
smoking_name = smoking.set_index("Borough")
smoking_name.head(5)

In [96]:
# Now you can use this index for selecting, the same way as before:
# Note that the column we used as the index is now no longer part of the dataframe in the same way the other columns are.
smoking_name.loc[["Brent", "Enfield"]]

Unnamed: 0_level_0,Current smokers,Ex smokers,Never smoked,Total
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brent,40827,45780,111547,198154
Enfield,42603,57726,118740,219069


###Example of the indexing operator. Not advised!!
It shows how it is less clear. You can't see as easily what is selected exactly in your code.

In [97]:
# Selecting rows: 
smoking[0:2]

Unnamed: 0,Borough,Current smokers,Ex smokers,Never smoked,Total
0,Barking and Dagenham,28722,31666,67056,127444
1,Barnet,49414,76507,153960,279881


In [100]:
# Selecting a single column:
smoking["Ex smokers"].head(10)

0    31666
1    76507
2    61118
3    45780
4    87694
5    67276
6    79044
7    62605
8    57726
9    50832
Name: Ex smokers, dtype: int64

In [110]:
# Chained indexing = BAD!! 
# creating a new dataframe and using that. You needlessly create more data, more clutter

# In this example we first make a new dataframe named new_df wich consist of only the first two lines of the original dataframe
new_df = smoking[0:2]
new_df

Unnamed: 0,Borough,Current smokers,Ex smokers,Never smoked,Total
0,Barking and Dagenham,28722,31666,67056,127444
1,Barnet,49414,76507,153960,279881


In [111]:
# We then use that new dataframe to filter out what we want to see.
new_df[["Borough", "Ex smokers"]]

Unnamed: 0,Borough,Ex smokers
0,Barking and Dagenham,31666
1,Barnet,76507


In [112]:
# You can write that in one line but it still means you make a new dataframe: 
smoking[0:2][["Borough", "Ex smokers"]]

Unnamed: 0,Borough,Ex smokers
0,Barking and Dagenham,31666
1,Barnet,76507
