# <b>CaRM Module: Advanced Topics in Data Preparation Using Python (2024/2025)</b>
## <b>Session 2: Selecting and re-encoding your data. </b>

### <b>Selection</b>

Each row and column of a dataframe has a unique label (or name) and a position (integer numbers). To select elements from a dataframe it is possible to use their "coordinates" by indicating their row and column positions or their row and column labels. Note that the row labels can be (integer numbers) different to the row positions (see example below).

#### <b>2.1. Selecting single elements in a DataFrame</b>

In [None]:
import pandas as pd

myDict = {
    'fruits': ['Apple', 'Orange','Pear', 'Melon'],
    'animals': ['Dog', 'Cat', 'Cow', 'Bird']
}

myDf = pd.DataFrame(myDict, index=[2,3,4,5]) # here we assign labels to the rows using the index argument
print('\nWith index=[2,3,4,5]:')
print(myDf)

# without the index argument above, the rows labels will start on 0 and be equivalent to the positions
print('\nWithout assigning row labels with the index argument:')
myDf = pd.DataFrame(myDict)
print(myDf)

# now select an element (e.g., 'Apple') using .iat and positions
print('\nSelection based on positions:')
print(myDf.iat[0,0])

# now select an element (e.g., 'Apple') using .at and labels
print('\nSelection based on labels:')
print(myDf.at[2,'fruits'])

#### <b>2.2. Selecting columns and rows.</b>

The most explicit and preferred way for indexing (i.e., accessing information with indexes) from dataframes and series is using the <b>.iloc</b> and <b>.loc</b> properties. Both .iloc and .loc are <b>properties</b> of DataFrames and Series (not functions, accessed like attributes, used with <b>[]</b>). Use .iloc with <b>positions</b> and .loc with <b>labels</b>.


<b>Example of usage of .iloc and .loc</b>

In [None]:
import pandas as pd

# create a dictionary
myDict = {
    'fruits': ['Apple', 'Orange', 'Pear', 'Melon'], 
    'animals': ['Dog', 'Cat', 'Cow' , 'Bird'], 
    }

# convert the dictionary to a dataframe
myDf = pd.DataFrame(myDict, index=[2,3,4,5]) # set integer labels for your rows, different from the positions
print(myDf)

print('\nUse iloc for accessing the first row position (0):')
print('\nThe output is a Series:')
print(myDf.iloc[0]) # as Series or 
#print(type(myDf.iloc[0]))
#print('\nIs a Series?')
#print(isinstance(myDf.iloc[0],pd.Series))

print('\nThe output is a DataFrame:')
print(myDf.iloc[[0]]) # as DataFrame
#print(type(myDf.iloc[[0]]))
#print('\nIs a DataFrame?')
#print(isinstance(myDf.iloc[[0]],pd.DataFrame))

print('\nUse loc for accessing the first row label (2):')
print('\nThe output is a Series:')
print(myDf.loc[2]) # as Series or 
print('\nThe output is a DataFrame:')
print(myDf.loc[[2]]) # as DataFrame

<b>2.2.1. Selecting rows or columns with *iloc*</b> (positions)
<br>
<br>
Property <b>.iloc</b> allows to select data based on <b>positions</b>. To retrieve a <b>single row or column</b>, use an <b>integer</b> inside square brackets.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# This gets the first row of the dataframe
# as a Series
print(df.iloc[0]) # note it uses square brackets, not parenthesis, because it is a property, not a method
# or as a DataFrame
#print(df.iloc[[0]])

# This gets the first column of the dataframe
# as a Series
#print(df.iloc[:,0]) # note it uses square brackets, not parenthesis, because it is a property, not a method
# or as a DataFrame
#print(df.iloc[:,[0]])

To retrieve <b>multiple rows or columns</b> with <b>iloc</b>, use a <b>list of integer positions</b> inside square brackets.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# This gets some selected rows of the dataframe with a list of indexes
print(df.iloc[[0,3,18]]) 

# This gets some selected columns of the dataframe with a list of indexes
#print(df.iloc[:,[0,12,17]])

<b>A note on Slicing</b>

<b>Slicing</b> is the extraction of a part of an indexable container (e.g., string, list, or tuple). It enables users to access the specific range of elements by mentioning their indices. 

<b>Index</b> is the position of an individual element in an indexable container. The index value always starts at zero and ends at one less than the number of items.

Negative indexes enable users to index an indexable container from the end of the container, rather than the start.

<b>Usage:</b> Container[start:stop:step]

<b>Start</b> specifies the starting index of a slice <br>
<b>Stop</b> specifies the ending element of a slice <br>
<b>Step</b> can be included to skip some elements <br>

It is also possible to use the slicing method with Pandas Series and DataFrames (also indexable containers).

In [None]:
import pandas as pd

myCont = 'python' # an indexable container can be a string
#myCont = ['p','y','t','h','o','n'] # an indexable container can be a list
#myCont = ('p','y','t','h','o','n') # an indexable container can be a tuple

print(myCont[0:5]) # note the stop element (n) is not printed
#print(myCont[0:]) # the whole container is printed, starting from the first element.
#print(myCont[:6]) # the whole container is printed. Start is 0 as default.
#print(myCont[0:6:2]) # here we print every two elements
#print(myCont[5]) # here we print the last element (n)
#print(myCont[-1]) # here we print the last element (n) using negative indexes
#print(myCont[-4:-1]) # again, the stop element (which is the last one, n) is not printed
#print(myCont[-4:-1:2]) # we can also use step with negative indexes

To retrieve a <b>sequence of rows or columns</b> using the <b>slicing</b> method with <b>iloc</b>, use stop:end integer positions inside square brackets.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# This gets a range of rows of the dataframe with the slicing method
#print(df.iloc[0:3]) # note that the row on position 3 is not selected

# This gets a range of columns of the dataframe with the slicing method
print(df.iloc[:,0:12]) # remember (start:stop:step) # column at the stop position is not retrieved

<b>Boolean indexing</b>

Boolean indexing involves creating a boolean vector where each element corresponds to a row or column in your DataFrame. The boolean vector contains True for rows or columns you want to keep and False for those you don’t. You can apply boolean vectors to rows <b>without the need of .loc and .iloc</b> (see examples below).

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.max_columns',100)
pd.set_option('display.width',1000)

print('\nThis prints only rows of female creatures:')
print(df[df['sex'] == 'female'])
#print(df[df.sex == 'female'])

#If you create the boolean vector based on more than one condition, 
# each condition must be enclosed in parentheses:
print('\nThis prints only rows of female humans:')
#print(df[(df['sex'] == 'female') & (df['species'] == 'Human')])
print(df[(df.sex == 'female') & (df.species == 'Human')])#

#boolidx = np.zeros((df.shape[0]), dtype='bool')
#print(boolidx)
#boolidx[0:5] = True
#print('\nThis prints only the first 5 rows:')
#print(df[boolidx])

#boolidx = np.zeros((df.shape[0]), dtype='bool')
#for idx in range(0,df.shape[0]):
    #if idx %2 == 0:
        #boolidx[idx] = True
#print('\nThis prints only every two rows:')
#print(df[boolidx])

It is possible to retrieve rows or columns with <b>iloc</b> using a <b>boolean list</b>, created for example based on conditions.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('mission1_data.csv')

boolidx = np.zeros((df.shape[0]), dtype=bool)
boolidx[0:5] = True
#print(df.iloc[boolidx]) # this works similar to previous example

# get rows based on a condition (e.g., all female creatures)
#print(df.iloc[(df.sex =='female').to_list()])

#print(df.iloc[((df.sex =='female') & (df.species=='Human')).to_list()])

# get columns with a specific data type
#print(df.dtypes)
#print(df.iloc[:,(df.dtypes==float).to_list()]) # float or object

<b>2.2.2. Acessing rows and columns with *.loc*</b> (labels)
<br>
<br>
Property <b>.loc</b> allows to select data based on index <b>labels</b> or <b>logical arrays</b>. To retrieve a <b>single row or column</b>, use an <b>integer</b> or <b>string label</b> inside square brackets.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# Currently, row labels are equivalent to positions
print(df.loc[[0]])

# But, if we assign string labels to the rows, we can use labels to get a specific row:
#df2 = df.set_index('name')
#print(df2.loc[['Luke Skywalker']])

# Similarly, we can assing integer labels to the rows, different from the row positions
#df2 = df.set_axis([i for i in range(1,df.shape[0]+ 1) ], axis=0)
#print(df2.loc[[1]])

# Also, you can use the loc method for extracting single columns
#print(df.loc[:,['name']])

# alternative ways to extract single columns without loc
#print(df.name)
#print(df['name'])

To retrieve <b>multiple rows or columns</b> with the loc property, use a <b>list of integer or string labels</b> inside square brackets.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.width', 1000) # changes the width of the window where the data is displayed

# Use a list of integer labels for selecting multiple rows
print(df.loc[[0,3,18]])

# For extracting multiple columns, use a list of string labels
#print(df.loc[:,['name', 'occupation', 'abilities']])

# Like in the previous section, if we assign string labels to the rows, we can use a list of labels to get multiple rows:
#df2 = df.set_index('name') # use this column as index labels and remove the column
#df2 = df.set_axis((df.name).to_list()) # now the row labels are names, not numbers, but keeps the column
#print(df2.loc[['Luke Skywalker','Darth Vader','Yoda']])

# You can access a part of the data using a list of integer labels for the rows, 
# and a list of string labels for the columns:
#print(df.loc[[0,3,18],['name', 'occupation', 'abilities']]) # note that here row labels are the same as positions

# You can access a part of the data using a list of string labels for the rows (e.g., person's or creature's names),
# and a list of string labels for the columns:
# print(df2.loc[['Luke Skywalker','Darth Vader','Yoda'],['name', 'occupation', 'abilities']])

To retrieve a <b>range of rows or columns</b> with the <b>loc</b> property. In this case, use <b>stop:end integer or string labels</b> inside square brackets. <b>The end label is included</b> in the output.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')
# for extracting multiple columns, use .loc
pd.set_option('display.width', 1000) # changes the width of the window where the data is displayed
df2 = df.set_index('name')
# you can even use the index label to get a range of rows
print(df2.loc['Luke Skywalker':'Darth Vader']) 

# or a range of columns
# print(df2.loc[:,'mass':'skin_color']) 

It is possible to retrieve rows or columns with <b>loc</b> using a <b>boolean array</b>, created for example based on conditions. Let's say you only want to retrieve the rows of the humans. In this case, you can use a boolean array.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('mission1_data.csv')

boolidx = np.zeros((df.shape[0]), dtype=bool)
boolidx[0:5] = True
print(df.loc[boolidx]) # this works like with .iloc 

# Note that it also works when the row indexes different to positions
#df2 = df.set_index('name')
#print(df2.loc[boolidx])

# However, you can use boolean series with the loc method (something you cannot do with iloc):
#print(df.loc[df.species == 'Human'])
#print(df.species == 'Human')
# if you attempt to run the following code, you will get an error, because the condition returns
# a boolean Series.
#print(df.iloc[df.species == 'Human'])
#print(type(df.species == 'Human'))
# You need to converty the boolean Series to a boolean list, for it to work with the iloc method. 
#print(df.iloc[(df.species == 'Human').to_list()])
# BTW, the loc method also works with boolean lists
#print(df.loc[(df.species == 'Human').to_list()])

# You could also use multiple conditions (e.g., those humans born between 40 BBY and 0 BBY)
# print(df.loc[(df.species == 'Human') & (df.birth_era == 'BBY') & (df.birth_year < 40)])

# or, you could search for specific terms in a column
#print(df.loc[df.occupation.str.contains('Jedi', regex=False)])

# It is also possible to select some columns with a boolean array (e.g., a filter based on the data type)
#print(df.loc[:,df.dtypes==float]) # remember that with iloc you needed to convert the boolean array to a list.

#### <b>2.3. Filtering data in DataFrames</b>

Filtering is used to select specific rows from the dataframe based on some conditions that are generally applied to specific columns of the dataframe. In the previous section, we saw that we could filter rows by using .iloc and .loc properties with boolean lists. Let's explore other possibilities for filtering data in dataframes.

<b>2.3.1. Filter rows by using comparison operators</b>

There are six comparison operators in Python: 
1) == or equal to 
2) != or not equal to
3) \> or greater than 
4) \>= or greater than or equal to
5) < or less than
6) <= or less than or equal to

They can be used to compare different values such as integers or strings. These operators can be included in an expression to filter elements in a DataFrame.

For example, we could search for specific terms in a column of strings:

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.width',1000)

# look for female creatures
print(df[df['sex'] == 'female'])
#print(df[(df['sex'] == 'female') & (df['species'] == 'Human')])

We could also filter a column with numeric values:

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.width', 1000)

# look for the tallest creatures
print(df[df['height'] >= 170])

# or
#print(df.loc[df['height'] >= 170])

<b>2.3.2. Filter rows using .isin()</b>

It is possible to filter rows based on specific terms or values in a column with strings or numbers. <b>.isin()</b> allows you to do that.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.max_columns',1000)

# applicable to strings
print(df[df['species'].isin(['Yoda\'s species', 'Wookiee', 'Droid'])])
# equivalent to
#print(df.loc[df['species'].isin(['Yoda\'s species', 'Wookiee', 'Droid'])])

# and numeric values
# print(df[df['death_year'].isin([0,3,4])])
# equivalent to
#print(df.loc[df['death_year'].isin([0,3,4])])

<b>2.3.3. Filter rows using str.contains()</b> 

It is possible to filter rows based on specific terms in a column with text strings. <b>str.contains()</b> allows you to do that.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.width',1000)
pd.set_option('display.max_columns',30)

print(df[df['abilities'].str.contains('Force|Lightsaber', case=False, na=False)]) #& ~df['abilities'].isna()

<b>2.3.4. Filter by applying the .query() method</b>

The <b>query()</b> method is a handy tool in pandas for filtering DataFrames based on conditional expressions. It is a concise and readable way to select specific rows that meet certain criteria. The input is a string that defines the filtering criteria and should evaluate to True or False. For multiple conditions, use logical operators and, or. Note: it only works if the column name doesn’t have any empty spaces!

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.width',1000)
pd.set_option('display.max_columns',30)

# applied to columns with strings
print(df.query('sex == "female"')) # remember, we did this before with df[df['sex'] == 'female']

# applied to columns with numbers
#print(df.query('height >=170')) # remember, we did this before with df.loc[df['height'] >=170]

# using variables
year_limit = 0
era = 'BBY'
#print(df.query('birth_year > @year_limit and birth_era == @era'))

<b>2.3.5. Filtering by using boolean columns</b>

You can create new Boolean columns (obtained by applying specific conditions) and use those columns to filter the rows of the dataframe.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

#print(df.columns) # print the column names

# Create new boolean column saying whether or not the creature has a lightsaber
colname = 'equipment'
keyword = 'Lightsaber'

df['haslightsaber'] = df[colname].str.contains(keyword, case=False, regex=False)
print(df) 

# Create new boolean column saying whether or not the creature is a Jedi
colname = 'occupation'
keyword = 'Jedi'

df['isjedi'] = df[colname].str.contains(keyword, case=False, regex=False)
# print(df)
 
# Create new boolean column saying whether or not the creature has Force powers
colname = 'abilities'
keyword = 'Force'

df['hasforce'] = df[colname].str.contains(keyword, case=False, regex=False)
# print(df) 

# Now, retrieve all the rows with a creature that has a lightsaber, force powers but is not a Jedi
pd.set_option('display.max_columns',30)
print(df[df.haslightsaber & df.hasforce & ~df.isjedi])

<b>2.3.6. Applying functions to dataframes</b>

The apply() method takes a function as an input and applies this function to a DataFrame. 
You must specify an axis you want your function to act on (axis=0 applies the function across rows of a column [by default]; axis=1 applies the function across columns in a row).

Apply a function across rows in a column:

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

def kg2pounds(x):
    return x*2.2

# convert weight (in Kg) to weight in pounds
df['weight_pounds'] = df['mass'].apply(kg2pounds)
print(df[['mass','weight_pounds']])

Apply a function across columns in a row:

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

def compute_age(df):
    if (df['birth_era'] == 'BBY') & (df['death_era'] == 'BBY'):
        return df['birth_year'] - df['death_year']
    elif (df['birth_era'] == 'BBY') & (df['death_era'] == 'ABY'):
        return df['birth_year'] + df['death_year']
    elif (df['birth_era'] == 'ABY') & (df['death_era'] == 'ABY'):
        return df['death_year'] - df['birth_year']
    else:
        return 'NaN'

# convert weight (in Kg) to weight in pounds
df['age'] = df.apply(compute_age, axis=1)
print(df[['birth_year','birth_era','death_year','death_era','age']])

Using a lambda function:

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('mission1_data.csv')

def calc_bmi(mass, height):
    return np.round(mass/(height/100)**2, 2)

df['bmi'] = df.apply(lambda x: calc_bmi(x['mass'],x['height']),axis=1)

pd.set_option('display.width',1000)
pd.set_option('display.max_rows',200)
print(df[['name','mass', 'height', 'bmi']])

Create a boolean column by applying a lambda function to a column

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# Create new boolean column saying whether or not the creature has a lightsaber
colname = 'equipment'
keyword = 'Lightsaber'

# we used this before: df['haslightsaber'] = df[colname].str.contains(keyword, case=False, regex=False)
pd.set_option('display.width',2000)

def lightsaberfun(x):
    if keyword.lower() in str(x[colname]).lower():
        return True 
    else:
        return False

df['haslightsaber'] = df.apply(lightsaberfun, axis=1)

# If the function is simple enough as to be declared in a single line, 
# you can obtain the same result with an anonymous function called lambda.
#df['haslightsaber'] = df.apply(lambda x: True if keyword.lower() in str(x[colname]).lower() else False, axis=1)
print(df[[colname, 'haslightsaber']]) 

# finally, you can get only the rows filtered with the boolean column.
print(df.loc[df.haslightsaber])

<b>2.3.7. Filtering based on dates</b>

If a DataFrame contains dates and time information, it is also possible to filter data based on those values. For this, the column with dates should be a datetime type.

In [None]:
import pandas as pd
from datetime import date
import numpy as np

# create a range of dates between today and the end of the year
dates_range = pd.date_range(start=date.today(), end="2024-12-31", freq="D")
print(dates_range)

fake_data = np.random.randint(0,75,len(dates_range)).tolist()
fake_data2 = np.random.randint(-10,24,len(dates_range)).tolist()

# create the dataframe
df = pd.DataFrame({'date': dates_range,
                   'rain_mm': fake_data, 
                   'temperature': fake_data2})

print(df.info()) # check that the column date is a datetime type

# Get new dataframe only with data from November by using .loc and a boolean expression
#print(df.loc[(df['date'] >= '2024-11-01') & (df['date'] <= '2024-11-30')])
#print(df.loc[(df.date >= '2024-11-01') & (df.date <= '2024-11-30')])

# Second, it is also possible to filter dataframe rows using the .query() function.
#print(df.query("date >= '2024-11-01' and  date <= '2024-11-30'"))
# Note that: 
# 1) The columns of the DataFrame can be included in the query by simply
# specifying the column name, so they can be accessed without indexing.
# 2) The whole expression is included between two double quotation marks ("")
# 3) The date string is included between single quotation marks ('')

# Third, it is possible to filter based on date attributes such as year, month, day, weekday
# get new dataframe only with data from December
#print(df.loc[(df['date'].dt.month==12)])
#print(df.loc[(df.date.dt.month==12)])

# Finally, it is also possible to use the strftime method
#print(df[df['date'].dt.strftime('%Y-%m-%d').between('2024-11-15','2024-11-30')])

### <b>Re-encoding</b>

#### <b>2.4. Renaming the column/row names (labels) of a DataFrame</b>

Sometimes, the original column or row names are long, complex, or do not represent well the data. It is possible to change the labels for more intuitive and handy names. In the previous section, we did so with the dataframe functions <b>.set_axis()</b> and <b>.set_index()</b>. There are additional options.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

pd.set_option('display.width', 1000) # changes the width of the window where the data is displayed

# For example, we could use use a column to set the index (row labels)
# The set_index() method used to assign a series, or another data frame 
# as the index of a given data frame.
df2 = df.set_index('name') # use this column as index labels and remove the column
#df2 = df.set_index(pd.Series(df.name.to_list())) # use this column as index labels and remove the column
print(df2.head())

# We could use another method (.set_axis()) to change the row and column labels
df2 = df.set_axis(df.name.to_list()) # now the row labels are names, not numbers, but keeps the column
print(df2.head())

# We could also use .set_index() to change the column names
colnames = df.columns.to_list()
colnames[1] = 'height_cm'
colnames[2] = 'weight_kg'

df2 = df.set_axis(colnames, axis=1)
print(df2.head())

Renaming columns is also necessary when the column names have white spaces. 

In [None]:
import pandas as pd

myDict = {
    'tropical fruits': ['banana', 'pineaple', 'mango', 'mamey'],
    'fruit color': ['white', 'yellow', 'orange', 'pink']
}

df = pd.DataFrame(myDict)

print(df)
print('\n')

# the following expression has a white space and will raise an error
# if you attempt to use the dot notation
#print(df.fruit color)

# however, you can call the column using the brackets notation
#print(df['fruit color'])

# so, you may want to rename the columns with labels that don't contain white spaces
#df2 = df.rename(columns={'tropical fruits': 'tropical_fruits', 'fruit color': 'fruit_color'})
# or, using .set_axis()
df2 = df.set_axis(['tropical_fruits', 'fruit_color'], axis=1)
#print('\nAfter renaming the columns')
print(df2)

#print(df2.fruit_color)

Sometimes, you only need to insert a term before the row/column name or after the row/column name. You could do that with <b>.add_prefix()</b> and <b>.add_suffix()</b> methods.

In [None]:
import pandas as pd

myDict = {    
    'fruits': ['banana', 'pineaple', 'mango', 'mamey'],
    'color': ['white', 'yellow', 'orange', 'pink'],
    'size': [20, 30, 8, 11],
}

df = pd.DataFrame(myDict)
df = df.set_index('fruits')
print(df)
print('\n')

# add prefix to the column names
df = df.add_prefix('fruit_')

# add prefix to the rows names
df = df.add_prefix('product_', axis=0)
print(df)

In [None]:
import pandas as pd

myDict = {    
    'fruits': ['banana', 'pineaple', 'mango', 'mamey'],
    'color': ['white', 'yellow', 'orange', 'pink'],
    'size': [20, 30, 8, 11],
}

df = pd.DataFrame(myDict)
df = df.set_index('fruits')
print(df)
print('\n')

# add prefix to the column names
df = df.add_suffix('_of_the_fruit')

# add prefix to the rows names
df = df.add_suffix('_row', axis=0)
print(df)

#### <b>2.5. Change the data type of columns in a DataFrame</b>

At this point, it would be good to convert some columns with object type to category type, and some columns with float64 type to integer type. This could save memory and also facilitate working with the data in following steps.

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

#print(df.info()) # print information about columns to remember their data types
print(df.columns)
# convert some columns to category type
df['sex'] = df['sex'].astype('category')
df['gender'] = df['gender'].astype('category')
# convert some columns to integer type
df['birth_year'] = df['birth_year'].astype('Int64')
df['death_year'] = df['death_year'].astype('Int64')
# convert some columns to string type
df['name'] = df['name'].astype('string')
df['abilities'] = df['abilities'].astype('string')
df.info()

#### <b>2.6. Adding new columns to the DataFrame</b>

Function <b>.assign()</b> allows you to add new columns to a dataframe.

In [None]:
import pandas as pd

myDict = {    
    'fruits': ['banana', 'pineaple', 'mango', 'mamey'],
    'color': ['white', 'yellow', 'orange', 'pink'],
    'size': [20, 30, 8, 11],
}

df = pd.DataFrame(myDict)
print(df)

newdf = df.assign(weight = [0.18, 1.5, 0.5, 0.75])
print(newdf)

Function <b>.concat()</b> allows you to add multiple columns to a dataframe (use axis=1).

In [None]:
import pandas as pd

myDict1 = {    
    'fruits': ['banana', 'pineaple', 'mango', 'mamey'],
    'color': ['white', 'yellow', 'orange', 'pink']
}

df1 = pd.DataFrame(myDict1)
print(df1)

myDict2 = {    
    'size': [20, 30, 8, 11],
    'weight': [0.18, 1.5, 0.5, 0.75]
}

df2 = pd.DataFrame(myDict2)
print(df2)

newdf = pd.concat((df1,df2), axis=1) # important: use axis=1 to concatenate along columns
print(newdf)

#### <b>2.7. Note on help functions</b>

To learn about different Python components and as a support when coding, you can use help() and dir() in-built functions.

<b>.help(object)</b> displays documentation about various Python objects including modules, functions, classes, and keywords.


In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

# If the argument is a string, then the string is looked up as the name of a module, 
# function, class, method, keyword, or documentation topic, and a help page is printed on the console. 
help('dict') # get help on a dictionary

# If the argument is any other kind of object, a help page on the object is generated.
#help(df.shape) # get help on the datatype of the object (in this case, a tuple)
#help(df.apply) # get help on a function
#help(df['name'].str.contains)

<b>.dir(object)</b> returns a list of valid attributes for that object.

In [None]:
import pandas as pd

help(dir)

# Without arguments, return the list of names in the current local scope. 
#dir()
dir(pd.DataFrame) # With an argument, attempt to return a list of valid attributes for that object.
#dir('a string')

#### <b>2.8. Working with dates in DataFrames</b>

### <b>Mission 2</b>
Beginning to work on Mission 2. Reading Excel files. 

In [None]:
import pandas as pd
from datetime import date

df = pd.read_csv('mission1_data.csv')

d0 = date(1815,6,30)
print(d0)

#ts = pd.Timestamp('1815-06-30 12:00:00')
ts = pd.Timestamp(d0)

print(ts)

# Creating a date offset object
date_offset = pd.DateOffset(days=368)

print(ts - date_offset)

def gst2gmt(x,y):
    if y == 'BBY':
        date_offset = pd.DateOffset(days=x*368)
        x = ts - date_offset
        return x
    elif y == 'ABY':
        date_offset = pd.DateOffset(days=(x)*368)
        x = ts + date_offset
        return x

def birth_gst2gmt(x):
    return gst2gmt(x['birth_year'], x['birth_era'])

def death_gst2gmt(x):
    return gst2gmt(x['death_year'], x['death_era'])

df['birth_date'] = df.apply(birth_gst2gmt, axis=1)

df['death_date'] = df.apply(death_gst2gmt, axis=1)

#print(df)
pd.set_option('display.width',2000)
print(df[['birth_year', 'birth_era', 'birth_date', 'death_year', 'death_era', 'death_date']])

#df.to_csv('mission1_data_gmt.csv')

In [None]:
import pandas as pd
from datetime import datetime

df = pd.read_csv('mission1_data_gmt.csv')

# df['age_gmt'] = 

In [None]:
import pandas as pd
from datetime import datetime

df = pd.read_csv('mission1_data_gmt.csv')

#print(df['birth_date'])

#df['birth_date'] = df['birth_date'].astype('string')
#df['death_date'] = df['death_date'].astype('string')

#df['birth_date'] = pd.to_datetime(df['birth_date'], as_unit='S')#, format='yyyy-mm-dd HH:MM:SS')
#df['death_date'] = pd.to_datetime(df['death_date'], format='yyyy-mm-dd HH:MM:SS')

#df['birth_date'] = df['birth_date'].apply(lambda x: datetime.strftime(x, '%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
#df['birth_date'] = df['birth_date'].apply(lambda x: pd.Period(x, freq='d'))
#df['death_date'] = df['death_date'].apply(lambda x: pd.Period(x, freq='s'))
print(df['birth_date'])
print(df.info())

<b>Excercises with mission data</b>

1. What are the most common birth places of the creatures in Mission 1 data file?

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

colname = 'birth_place'
print(df[colname].value_counts()) # print a Series containing counts of unique values.

2. What are the most common death places of the creatures in Mission 1 data file?

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

colname = 'death_place'
print(df[colname].value_counts()) # print a Series containing counts of unique values.

3. What are the most common homeworlds of the creatures in Mission 1 data file?

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

colname = 'homeworld'
print(df[colname].value_counts()) # print a Series containing counts of unique values.

4. What are the most common homeworlds of the humans?

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

df2 = df[df.species == 'Human']
#print(df2.describe())
colname = 'homeworld'
print(df2[colname].value_counts()) # print a Series containing counts of unique values.

5. What are the most common birth places of the Jedis in Mission 1 data file?

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

df2 = df[df.occupation.str.contains('Jedi', case=False, regex=False)]
#print(df2)
colname = 'birth_place'
print(df2[colname].value_counts()) # print a Series containing counts of unique values.

6. What are the most common death places of the Jedis in Mission 1 data file?

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

df2 = df[df.occupation.str.contains('Jedi', case=False, regex=False)]
#print(df2)
colname = 'death_place'
print(df2[colname].value_counts()) # print a Series containing counts of unique values.

7. What is the year where most creatures died?

In [None]:
import pandas as pd
import math

df = pd.read_csv('mission1_data.csv')

def combineyearera(x):
    if math.isnan(x['death_year']):
        return 'NaN'
    else:
        return str(int(x['death_year'])) + ' ' + x['death_era']

df['death_year_era'] = df.apply(combineyearera, axis=1)
#print(df)
colname = 'death_year_era'
print(df[colname].value_counts()) # print a Series containing counts of unique values.

8. What is the year when most Jedis died?

In [None]:
import pandas as pd
import math

df = pd.read_csv('mission1_data.csv')

def combineyearera(x):
    if math.isnan(x['death_year']):
        return 'NaN'
    else:
        return str(int(x['death_year'])) + ' ' + x['death_era']
    
df['death_year_era'] = df.apply(combineyearera, axis=1)
    
df2 = df[df.occupation.str.contains('Jedi', case=False, regex=False)]

#print(df2)
colname = 'death_year_era'
print(df2[colname].value_counts()) # print a Series containing counts of unique values.

9. What is the place when most Jedis died in 19 BBY?

In [None]:
import pandas as pd
import math

df = pd.read_csv('mission1_data.csv')

def combineyearera(x):
    if math.isnan(x['death_year']):
        return 'NaN'
    else:
        return str(int(x['death_year'])) + ' ' + x['death_era']
    
df['death_year_era'] = df.apply(combineyearera, axis=1)
    
df2 = df[(df.occupation.str.contains('Jedi', case=False, regex=False, na=False)) 
         & (df.death_year_era.isin(['19 BBY']))]

#print(df2)
colname = 'death_place'
print(df2[colname].value_counts()) # print a Series containing counts of unique values.