# Tutorial 3: Working with data types in Python

There are 11 Tasks in this notebook (though warning some tasks include more than one task) and 3 sections in this tutorial notebook that you can choose between: 

1. String data 
2. Categorical data (and Boolean and Numeric and Missing) which includes some advanced bonus tasks
3. Date and time data (and string)

The aim of this tutorial notebook is to give you some (guided) hands-on experience working with different data types in Python. Which you can then compare with the approaches to working with these data types in R. 

In [1]:
# it is always good practice to load the necessary packages and modules at the start of your document
import pandas as pd 
from pandas.api.types import CategoricalDtype 
import numpy as np
import datetime as dt 
import re
import itertools 
from dateutil import parser, tz, relativedelta

## there is a future warning that looks scary, but does not matter to us at the moment, so this code supresses it
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1. String data 

### Task 1 

How would you access "solstice" in `string0` below in code?

In [2]:
string0 = "The summer solstice is on Thursday 20 June 2024"

<details><summary style='color:darkblue'>HINT 1: How to start breaking it down? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

We learned about a function in the Python data types notebook which helps us to identify the index of a string (see section 4). From there, we can use that information to access or *slice* the string 

In [None]:
## your answer here



#### Task 1 solution 

In [3]:
# first find the index for the word in the string 
print(string0.index("solstice"))

11


In [4]:
# then taking that info slice the string
print(string0[11:19])

solstice


## Task 2 - 6

There is a corpus of common words in the R `stringr` package that we will use as our data for this task. 

The process of importing this data and making it workable for this task is a bit complicated. I have outlined the logical steps below in code and here in plain language.

First we need to read in the data. To do so, we use the `pd.read_csv('file.csv')` function from the `pandas` package. `read_csv` reads in data from a csv file automatically as a `pandas` data frame structure. Usually this is what we want (as we will see next week), but in this case `words` is a list, so we will then convert the data structure to a list. But oh no, it is a list within a list! We then need to flatten the list structure and join them to be separated by a space, creating a string that we can work with. You could leave the data structure as a list within a list or indeed as a list, but for the purposes of this week, we are learning how to interact with strings. 


In [5]:
# read in data
## my data is in a folder called data. If you do not have the same set up, update the file path accordingly 
word_data = pd.read_csv('../data/common_words.csv', header = None) # the first row is not a header, so I have specified header = None 

type(word_data) # indeed words is currently a data frame 

pandas.core.frame.DataFrame

In [6]:
# view the data that has been read in as a pandas data frame 
print(word_data)

             0
0            a
1         able
2        about
3     absolute
4       accept
..         ...
975        yes
976  yesterday
977        yet
978        you
979      young

[980 rows x 1 columns]


In [7]:
# now convert words to a list for this task 
word_list = word_data.values.tolist()

type(word_list) 

list

When reading in data, it is always good practice to print it to make sure it parsed as expected. For this we can use `print()`

In [8]:
# to see the list within a list structure if you are interested
print(word_list)

# notice we now have lists within a list [[...]]

[['a'], ['able'], ['about'], ['absolute'], ['accept'], ['account'], ['achieve'], ['across'], ['act'], ['active'], ['actual'], ['add'], ['address'], ['admit'], ['advertise'], ['affect'], ['afford'], ['after'], ['afternoon'], ['again'], ['against'], ['age'], ['agent'], ['ago'], ['agree'], ['air'], ['all'], ['allow'], ['almost'], ['along'], ['already'], ['alright'], ['also'], ['although'], ['always'], ['america'], ['amount'], ['and'], ['another'], ['answer'], ['any'], ['apart'], ['apparent'], ['appear'], ['apply'], ['appoint'], ['approach'], ['appropriate'], ['area'], ['argue'], ['arm'], ['around'], ['arrange'], ['art'], ['as'], ['ask'], ['associate'], ['assume'], ['at'], ['attend'], ['authority'], ['available'], ['aware'], ['away'], ['awful'], ['baby'], ['back'], ['bad'], ['bag'], ['balance'], ['ball'], ['bank'], ['bar'], ['base'], ['basis'], ['be'], ['bear'], ['beat'], ['beauty'], ['because'], ['become'], ['bed'], ['before'], ['begin'], ['behind'], ['believe'], ['benefit'], ['best'], ['

In [9]:
# flatten the list structure, this uses the intertools module function chain 
word_list_flat = list(itertools.chain(*word_list)) 

print(word_list_flat)
# great, we are getting there! 

['a', 'able', 'about', 'absolute', 'accept', 'account', 'achieve', 'across', 'act', 'active', 'actual', 'add', 'address', 'admit', 'advertise', 'affect', 'afford', 'after', 'afternoon', 'again', 'against', 'age', 'agent', 'ago', 'agree', 'air', 'all', 'allow', 'almost', 'along', 'already', 'alright', 'also', 'although', 'always', 'america', 'amount', 'and', 'another', 'answer', 'any', 'apart', 'apparent', 'appear', 'apply', 'appoint', 'approach', 'appropriate', 'area', 'argue', 'arm', 'around', 'arrange', 'art', 'as', 'ask', 'associate', 'assume', 'at', 'attend', 'authority', 'available', 'aware', 'away', 'awful', 'baby', 'back', 'bad', 'bag', 'balance', 'ball', 'bank', 'bar', 'base', 'basis', 'be', 'bear', 'beat', 'beauty', 'because', 'become', 'bed', 'before', 'begin', 'behind', 'believe', 'benefit', 'best', 'bet', 'between', 'big', 'bill', 'birth', 'bit', 'black', 'bloke', 'blood', 'blow', 'blue', 'board', 'boat', 'body', 'book', 'both', 'bother', 'bottle', 'bottom', 'box', 'boy', '

In [10]:
# join the lists to be a string separated by a space 
words = " ".join(word_list_flat)

In [11]:
print(words) # happy days - we are ready to go 

a able about absolute accept account achieve across act active actual add address admit advertise affect afford after afternoon again against age agent ago agree air all allow almost along already alright also although always america amount and another answer any apart apparent appear apply appoint approach appropriate area argue arm around arrange art as ask associate assume at attend authority available aware away awful baby back bad bag balance ball bank bar base basis be bear beat beauty because become bed before begin behind believe benefit best bet between big bill birth bit black bloke blood blow blue board boat body book both bother bottle bottom box boy break brief brilliant bring britain brother budget build bus business busy but buy by cake call can car card care carry case cat catch cause cent centre certain chair chairman chance change chap character charge cheap check child choice choose Christ Christmas church city claim class clean clear client clock close closes clothe

Now we are ready for the task! How many words: 

- Task 2: Start with "y"
- Task 3: End with "w"
- Task 4: Are exactly 3 letters long
- Task 5: Have 8 letters or more
- Task 6: contain only consonants
 
Tasks 4, 5, and 6 are a bit more tricky. 

To really stretch yourself, consider using code to produce the answer to the above questions once you have a solution. 

<details><summary style='color:darkblue'>HINT 1: How to start breaking it down? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
To address these problems you will need to use regular expressions. There is a helpful Python regular expression [cheat sheet here](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

<details><summary style='color:darkblue'>HINT 2: Useful functions. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

You do not need to manually count the string outputs, remember the `len()` function

<details><summary style='color:darkblue'>HINT 3: How to use code to produce the answer?. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

We learned how to do string interpolation this week in the Python data types notebook (see section 4)

In [None]:
## your answer here



#### Task 2 - 6 Solutions

In [12]:
# Task 2: Start with "y"

print(re.findall(r'\b[y]\w+', words)) 

## the r prefix before the pattern string creates a raw string, which we need for the following regular expression  
## \b maching the boundary at the start or end of a word (a non-word character)
## [] contain a set of characters to match 
## y is our alphanumeric character of interest to match 
## \w matches alphanumeric characters 
## + greedily matches the expression to its left 1 or more times

## remove or edit one or more of these regex components to get a better understanding of how this works 

['year', 'yes', 'yesterday', 'yet', 'you', 'young']


There are 6 words that start with 'y'. We could also write this answer using code! For example: 

In [13]:
# first we need to create an object with our list of words that match out criteria 
task2 = re.findall(r'\b[y]\w+', words)

# then we can use an f-string 
f"There are {len(task2)} words that start with 'y' in the data"

"There are 6 words that start with 'y' in the data"

In [14]:
## if you wrap your fstring in a print function, it will print without the quotation marks 
print(f"There are {len(task2)} words that start with 'y' in the data")

There are 6 words that start with 'y' in the data


In [15]:
## or we can use print without the f-string, which is slightly less elegant 
print("There are", len(task2), "words that start with 'y' in the data")

There are 6 words that start with 'y' in the data


In [16]:
# Task 3: End with "w"
print(re.findall(r'\w+[w]\b', words))

## the r prefix before the pattern string creates a raw string, which we need for the following regular expression  
## \w matches alphanumeric characters
## + greedily matches the expression to its left 1 or more times
## [] contain a set of characters to match 
## w is our alphanumeric character of interest to match 
## \b maching the boundary at the start or end of a word (a non-word character)

print(len(re.findall(r'\w+[w]\b', words)))


['allow', 'blow', 'draw', 'few', 'follow', 'grow', 'how', 'know', 'law', 'low', 'new', 'now', 'show', 'slow', 'throw', 'tomorrow', 'view', 'window']
18


In [17]:
## and now to answer the question with code 
task3 = re.findall(r'\w+[w]\b', words)

print(f"There are {len(task3)} words that end with 'w' in the data")

There are 18 words that end with 'w' in the data


In [18]:
# Task 4: Are exactly 3 letters long
task4 = re.findall(r'(\b\w{3}\b)', words)

## in regex {} matches exactly exact input number of copies


In [19]:
print(f"There are {len(task4)} words with exactly 3 letters in the data")

There are 110 words with exactly 3 letters in the data


In [20]:
# Task 5: Have 8 letters or more
task_5 = re.findall(r'(\b\w{8,12}\b)', words)

## {} matches exactly exact input number of copies from number, to number 

In [21]:
print(f"There are {len(task_5)} words that have 8 letters or more in the data")

There are 100 words that have 8 letters or more in the data


In [22]:
# Task 6: contain only consonants

## one approach is to say NOT vowels 
task6_0 = re.findall(r'\b[^aeiou\W]+\b', words, flags = re.IGNORECASE)

## within a set [] ^ means not so adding ^ at the front excludes any character in the set
## + matches previous token 1+ times 
## \W is a meta escape matching any non-word character (this removes the empty spaces)
### delete in the code above and see what happens


In [23]:
print(f"There are {len(task6_0)} words which contain only consonants in the data")

There are 6 words which contain only consonants in the data


In [24]:
## another approach is to specify only consonants more manually 
task6_1 = re.findall(r'\b[b-df-hj-np-tv-z]+\b', words, flags = re.IGNORECASE)

In [25]:
print(f"There are {len(task6_1)} words which contain only consonants in the data")

There are 6 words which contain only consonants in the data


## 2. Categorical data (and Boolean and Numeric and Missing) 

For this section, there is yet again a data set available in R that we will be using. This time the data comes from the `forcats` package. `forcats::gss_cat` is a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. As the survey has thousands of questions, the `gss_cat` data contains a small subset. 

Since this data set is provided by an R package, you can get more information about the variables **in R** with `?gss_cat`. 

As above, to read in the data we will be using our new friend, the the `pd.read_csv('file.csv')` function from the `pandas` package.


In [26]:
# read in data
## my data is in a folder called data. If you do not have the same set up, update the file path accordingly 
gss_cat = pd.read_csv('../data/gss_cat.csv')

gss_cat

Unnamed: 0,year,marital,age,race,rincome,partyid,relig,tvhours
0,2000,Never married,26.0,White,$8000 to 9999,"Ind,near rep",Protestant,12.0
1,2000,Divorced,48.0,White,$8000 to 9999,Not str republican,Protestant,
2,2000,Widowed,67.0,White,Not applicable,Independent,Protestant,2.0
3,2000,Never married,39.0,White,Not applicable,"Ind,near rep",Orthodox-christian,4.0
4,2000,Divorced,25.0,White,Not applicable,Not str democrat,,1.0
...,...,...,...,...,...,...,...,...
21478,2014,Widowed,89.0,White,Not applicable,Not str republican,Protestant,3.0
21479,2014,Divorced,56.0,White,$25000 or more,Independent,,4.0
21480,2014,Never married,24.0,White,$10000 - 14999,"Ind,near dem",,4.0
21481,2014,Never married,27.0,White,$25000 or more,Not str democrat,Catholic,


In [27]:
## we will learn more about data frames next week but one function to get a summary of what a data frame contains is data.info()
## data.info() is similiar to glimpse() in R 
gss_cat.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21483 entries, 0 to 21482
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   year     21483 non-null  int64  
 1   marital  21483 non-null  object 
 2   age      21407 non-null  float64
 3   race     21483 non-null  object 
 4   rincome  21483 non-null  object 
 5   partyid  21483 non-null  object 
 6   relig    17960 non-null  object 
 7   tvhours  11337 non-null  float64
dtypes: float64(2), int64(1), object(5)
memory usage: 1.3+ MB


### Task 7
What data types are the different variables in `gss_cat`? If there is categoical data, should it be nominal or ordinal? What are the categories of the categorical data? Should or could any of the variable be represented differently in terms of data type?

Go through the data set one column at a time answering the above questions for each of the 8 columns. 

<details><summary style='color:darkblue'>HINT 1: Some new functions! CLICK HERE TO SEE THE ANSWER.</summary>

This task asks for you to do something we are familiar with from last week - thinking about data types - but to do so you need to use some code we have not learned about yet for working with dataframes. To select a single column, use square brackets `[]` with the column name of the column of interest as a character string - e.g., `dataframe["column"]`. We will discuss this more next week, for those interested, the returned object is a `pandas Series` (which we used some in the Python Data Types Notebook). 
    
Then you can use the `data.head()` function to see the first few rows of data as well as the data type.
    
You can also use `data.decribe()` to get a summary of the variable. To get more higher level information you can use `data.info()`

<details><summary style='color:darkblue'>HINT 2: A new data type (gasp there's more!). CLICK HERE TO SEE THE ANSWER.</summary>

You will notice that some variables/columns are something called type `object`... but what does this mean?? In `pandas` a `object dtype` represents text or mixed numeric and non-numeric values. An `object` is a string in `pandas` so it performs a string operation instead of a mathematical one. Thus, if you want a variable to be treated as categorical (`dtype category`) you need to explicitly cast it as such. The simplest way to convert a column to a categorical type is to use `astype('category')`.

In [None]:
## your answer here



#### Task 7 Solution

#### 7.1 year 

In [28]:
print(gss_cat["year"].head())

0    2000
1    2000
2    2000
3    2000
4    2000
Name: year, dtype: int64


In [29]:
print(gss_cat["year"].describe())

count    21483.000000
mean      2006.501978
std          4.451994
min       2000.000000
25%       2002.000000
50%       2006.000000
75%       2010.000000
max       2014.000000
Name: year, dtype: float64


In [30]:
print(gss_cat["year"].info())

<class 'pandas.core.series.Series'>
RangeIndex: 21483 entries, 0 to 21482
Series name: year
Non-Null Count  Dtype
--------------  -----
21483 non-null  int64
dtypes: int64(1)
memory usage: 168.0 KB
None


`Year` is an integer (specifically `int64`) ranging from 2000 to 2014, but may be suited as an ordered factor depending on the use case.

#### 7.2 marital

In [31]:
print(gss_cat["marital"].head())

0    Never married
1         Divorced
2          Widowed
3    Never married
4         Divorced
Name: marital, dtype: object


In [32]:
print(gss_cat["marital"].describe())

count       21483
unique          6
top       Married
freq        10117
Name: marital, dtype: object


`Martial` is an currently an `object` data type, meaning it is being treated as a string, with 6 unique responses (eventually categories). It would make most sense to be a nominal categorical variable. Though using an ordinal category and having "Never married" sorted as the first level then "Married" followed by "Separated", "Divorced", "Widowed", and "No answer" would make sense for some data presentations.

#### 7.3 age

In [33]:
print(gss_cat["age"].head())

0    26.0
1    48.0
2    67.0
3    39.0
4    25.0
Name: age, dtype: float64


In [34]:
print(gss_cat["age"].describe())

count    21407.000000
mean        47.180081
std         17.287500
min         18.000000
25%         33.000000
50%         46.000000
75%         59.000000
max         89.000000
Name: age, dtype: float64


`Age` is a float (specifically `float64`) ranging from 18 to 89. Depending on the analytic use case, `age` could also work as an oridnal categorical variable with user-determined-determined age brackets.

#### 7.4 race

In [35]:
print(gss_cat["race"].head())

0    White
1    White
2    White
3    White
4    White
Name: race, dtype: object


In [36]:
print(gss_cat["race"].describe())

count     21483
unique        3
top       White
freq      16395
Name: race, dtype: object


`Race` is an an `object` data type with 3 unique responses. It makes sense to cast `race` as a nominal categorical variable. 

#### 7.5 rincome

In [37]:
print(gss_cat["rincome"].head())

0     $8000 to 9999
1     $8000 to 9999
2    Not applicable
3    Not applicable
4    Not applicable
Name: rincome, dtype: object


In [38]:
print(gss_cat["rincome"].describe())

count              21483
unique                16
top       $25000 or more
freq                7363
Name: rincome, dtype: object


`rincome` is an `object` data type with 16 unique responses. This variable would likely be most useful as categorical. Depending on the amount of detail needed for analysis of this variable, it would make sense to collapse some categories together and make the variable an ordered categorical. It could also be useful to convert to a string in some instances.

#### 7.6 partyid

In [39]:
print(gss_cat["partyid"].head())

0          Ind,near rep
1    Not str republican
2           Independent
3          Ind,near rep
4      Not str democrat
Name: partyid, dtype: object


In [40]:
print(gss_cat["partyid"].describe())

count           21483
unique             10
top       Independent
freq             4119
Name: partyid, dtype: object


`partyid` is an `object` data type with 10 unique responses. It would make sense to cast this varibale as categorical. In some use cases it may be useful to make this an ordinal categorical variable, ordering the categories according to how they fall on the political spectrum. 

#### 7.7 relig

In [41]:
print(gss_cat["relig"].head())

0            Protestant
1            Protestant
2            Protestant
3    Orthodox-christian
4                   NaN
Name: relig, dtype: object


In [42]:
print(gss_cat["relig"].describe())

count          17960
unique            14
top       Protestant
freq           10846
Name: relig, dtype: object


`relig` is an `object` data type 10 unique responses. It would like be useful to cast `relig` as a nominal categorical variable.

#### 7.8 tvhours

In [43]:
print(gss_cat["tvhours"].head())

0    12.0
1     NaN
2     2.0
3     4.0
4     1.0
Name: tvhours, dtype: float64


In [44]:
print(gss_cat["tvhours"].describe())

count    11337.000000
mean         2.980771
std          2.587151
min          0.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         24.000000
Name: tvhours, dtype: float64


`tvhours` is a float (specifically `float64`) from 0 to 24. Depending on the analytic use case, `tvhours` could also work as an ordinal categorical variable with user-determined-determined category levels.

## Task 8 

Make `age` a factor. When modifying an object by changing values or the data type, it is good practice to create a new object with a meaningfully modified name rather than over-write the original one. 

In [45]:
# first let's take out age as a pandas Series 
age = gss_cat["age"]

In [46]:
type(age)

pandas.core.series.Series

In [None]:
## your answer here



#### Task 8 Solution 

In [47]:
age_f = age.astype("category")

After changing an object, it is good practice to check that your code made the expected changes.

In [48]:
age_f.head() # all good 

0    26.0
1    48.0
2    67.0
3    39.0
4    25.0
Name: age, dtype: category
Categories (72, float64): [18.0, 19.0, 20.0, 21.0, ..., 86.0, 87.0, 88.0, 89.0]

### Task 8 Advanced 

It is a bit more advanced to from here make `age` an ordinal categorical data type with 5 levels: 18-25, 26-44, 45-64, 65-74, 75+. To do so, we need to use `pandas.cut()` to sort out data values into bins. 

**Before looking at the solution, challenge yourself to think about the logical steps needed to solve this problem. Write them down and see how they match up to the solution provided.**

##### Your answer to the logical steps 

1. ....


2. ....


3. ....


4. ....


....

As we have not learned about this function yet, I will show you the solution and ask for you to try and figure out how it works. Modify some of the code to see what happens. Do not worry, you cannot break your computer (unless you throw it perhaps)! If you have any errors you cannot figure out, ask one of the teaching team for help during the tutorial or post on the discussion boards afterwards. 

Before making any changes to our variable, it is good practice to check if there are any missing values lurking in the shadows trying to ruin our day. 

In [49]:
age_f.isna().values.any()

## try to run the above about without .values.any() to see why we need it 

True

In [50]:
# next we can use sum to see how many there are 
## becuase is.na() returns Boolean values and True is truthy, we can use sum... understanding data types is so useful! 

age_f.isna().sum()

76

So we do indeed have some missing values (76 to be exact), which we will keep in mind. 

In [51]:
age_groups = pd.cut(
    age_f,
    bins = [-np.inf, 25, 44, 64, 74, np.inf],
    labels = ["18-25", "26-44", "45-64", "65-74", "75+"]) 

In [52]:
age_groups.head() # looking good 

0    26-44
1    45-64
2    65-74
3    26-44
4    18-25
Name: age, dtype: category
Categories (5, object): ['18-25' < '26-44' < '45-64' < '65-74' < '75+']

In [53]:
age_groups.cat.categories # celebration!

Index(['18-25', '26-44', '45-64', '65-74', '75+'], dtype='object')

In [54]:
# and look at that, pandas.cut() made it ordered for us too! 
## this is because the default behavior of the argument ordered is True
print(age_groups.cat.ordered) 

True


Reading documentation is a skill that you will develop over time with practice. Try and read the documentation for the cut function from pandas to see what you and learn. Ask a member of the teaching team during the tutorial or post on the discussion boards if you get stuck.

In [55]:
# look at the documentation for more info 
help(pd.cut)

Help on function cut in module pandas.core.reshape.tile:

cut(x, bins, right: 'bool' = True, labels=None, retbins: 'bool' = False, precision: 'int' = 3, include_lowest: 'bool' = False, duplicates: 'str' = 'raise', ordered: 'bool' = True)
    Bin values into discrete intervals.

    Use `cut` when you need to segment and sort data values into bins. This
    function is also useful for going from a continuous variable to a
    categorical variable. For example, `cut` could convert ages to groups of
    age ranges. Supports binning into an equal number of bins, or a
    pre-specified array of bins.

    Parameters
    ----------
    x : array-like
        The input array to be binned. Must be 1-dimensional.
    bins : int, sequence of scalars, or IntervalIndex
        The criteria to bin by.

        * int : Defines the number of equal-width bins in the range of `x`. The
          range of `x` is extended by .1% on each side to include the minimum
          and maximum values of `x`.
    

### Task 9 

How could you collapse `rincome` into a small set of categories (e.g., `"Unknown"`, `"less than $5000"`, `"$5000 to $9999"`, `"$10000 or more"`)?

Look at some summaries of the object and think about some of the challenges that you need to overcome to complete the task. Write down the steps you would need to take in plain language, regardless of if you know how to do it in code. Understanding **what** you need to do is just an important, if not more so, than **how** you will do it (i.e., in code).

You can also look back at your solution to Task 7 above. 


In [56]:
# first let's take out rincome as a pandas Series 
rincome = gss_cat["rincome"]

print(rincome.describe())

count              21483
unique                16
top       $25000 or more
freq                7363
Name: rincome, dtype: object


In [57]:
print(rincome.head())

0     $8000 to 9999
1     $8000 to 9999
2    Not applicable
3    Not applicable
4    Not applicable
Name: rincome, dtype: object


In [58]:
print(rincome.unique())

['$8000 to 9999' 'Not applicable' '$20000 - 24999' '$25000 or more'
 '$7000 to 7999' '$10000 - 14999' 'Refused' '$15000 - 19999'
 '$3000 to 3999' '$5000 to 5999' "Don't know" '$1000 to 2999' 'Lt $1000'
 'No answer' '$6000 to 6999' '$4000 to 4999']


##### Your answer to the logical steps 

1. ....


2. ....


3. ....


4. ....


....


#### Task 9 Solution 


`rincome` is currently an object data type. It is also quite messy. The unique categories do not have consistent naming conventions (some say "to" and other use "-") so we need fix that. Then, we will want to group all of the non-responses into one category and the responses into 3 categories. We will then want to cast this as an ordered categorical data type.

### Advanced bonus task (Task 9) 

The advanced bonus task, should you choose to accept it, is to attempt your solution to Task 9 in code! See how far you can get! Have a look at the solutions document for a worked solution to this task. 

In [None]:
## your answer here



#### Advanced bonus task Solution (Task 9) 

In order to complete this task we can use python data structures to our advantage (we will learn more about this next week). I have used a data structure called dictionaries (which store `key:value` pairs and are denoted with `{}`) and manipulated the variable within the dataframe, rather than subsetting it. I also used the `map()` function to iterate over the column and stored our solution as a new column called `rincome_cat`. 

In [59]:
## this step is not necessary as we could have just used the label as it is in the dictionary created below
## but I have shown in just in case you tried something similiar 
rincome_name = rincome.replace({'$20000 - 24999' : '$20000 to 24999', 
                                '$15000 - 19999' : '$15000 to 19999',
                                "$10000 - 14999" : '$10000 to 14999'})

# check it worked as expected
rincome_name.unique()

array(['$8000 to 9999', 'Not applicable', '$20000 to 24999',
       '$25000 or more', '$7000 to 7999', '$10000 to 14999', 'Refused',
       '$15000 to 19999', '$3000 to 3999', '$5000 to 5999', "Don't know",
       '$1000 to 2999', 'Lt $1000', 'No answer', '$6000 to 6999',
       '$4000 to 4999'], dtype=object)

In [60]:
# create a dictionary with our desired names 
map_dict = {"No answer" : "Unknown", 
            "Don't know" : "Unknown",  
            "Refused" : "Unknown", 
            "Not applicable" : "Unknown",
            "Lt $1000" : "Less than $5000", 
            "$1000 to 2999" : "Less than $5000", 
            "$3000 to 3999" : "Less than $5000", 
            "$4000 to 4999" : "Less than $5000",
            "$5000 to 5999" : "$5000 to $9999", 
            "$6000 to 6999" : "$5000 to $9999", 
            "$7000 to 7999" : "$5000 to $9999", 
            "$8000 to 9999" : "$5000 to $9999", 
            "$10000 - 14999" : "$10000 or more", 
            "$15000 - 19999" : "$10000 or more", 
            "$20000 - 24999" : "$10000 or more", 
            "$25000 or more" : "$10000 or more"}

# create new column, interate over rincome with the dictionary created above, and change type to category 
gss_cat["rincome_cat"] = gss_cat["rincome"].map(map_dict).astype("category")

gss_cat.describe(include = 'all') # excellent, looks like we have 4 categories in rincome_cat 

Unnamed: 0,year,marital,age,race,rincome,partyid,relig,tvhours,rincome_cat
count,21483.0,21483,21407.0,21483,21483,21483,17960,11337.0,21483
unique,,6,,3,16,10,14,,4
top,,Married,,White,$25000 or more,Independent,Protestant,,$10000 or more
freq,,10117,,16395,7363,4119,10846,,10862
mean,2006.501978,,47.180081,,,,,2.980771,
std,4.451994,,17.2875,,,,,2.587151,
min,2000.0,,18.0,,,,,0.0,
25%,2002.0,,33.0,,,,,1.0,
50%,2006.0,,46.0,,,,,2.0,
75%,2010.0,,59.0,,,,,4.0,


In [61]:
gss_cat["rincome_cat"].unique() 
## looking good but we also now want it to be an ordered categorical variable 

['$5000 to $9999', 'Unknown', '$10000 or more', 'Less than $5000']
Categories (4, object): ['$10000 or more', '$5000 to $9999', 'Less than $5000', 'Unknown']

In [62]:
gss_cat["rincome_cat"] = gss_cat["rincome_cat"].cat.as_ordered() 

In [63]:
gss_cat["rincome_cat"].unique() ## happy days 

['$5000 to $9999', 'Unknown', '$10000 or more', 'Less than $5000']
Categories (4, object): ['$10000 or more' < '$5000 to $9999' < 'Less than $5000' < 'Unknown']

## 3. Date and time data (and string)

### Task 10 

Create an object showing the date 140 days from now and print the output nicely formatted (`"month day, year at hour minute"`) using `strftime()`. Then create an object with the date 2 years from now and similarly print the output nicely formatted. 

<details><summary style='color:darkblue'>HINT 1: How to start breaking it down? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
`timedelta` instances allow for arithmetic, but only at the level of days, hours, minutes, or seconds. To add or subtract intervals larger than a day, such as a month or a year we use `relativedelta`

In [None]:
## your answer here



#### Task 10 Solution 

In [66]:
## 140 days from now 
days140 = dt.datetime.now() + dt.timedelta(days =+ 140)  
# the after the + is not strictly needed, but is to explicitly state that I want a positive integer 
## change to a - to see what happens 

print(days140.strftime("%B %d, %Y at %H:%M")) 

September 20, 2024 at 13:31


In [67]:
## 2 years from now 
years2 = dt.datetime.now() + relativedelta.relativedelta(years = 2)

print(years2.strftime("%B %d, %Y at %H:%M")) 

May 03, 2026 at 13:31


## Task 11 

This is a big one, so I have separated the task into different parts. By the end, you will have made a countdown clock to your birthday! (how cool!)

We will start by making a countdown clock until the annual Fringe Festival in Edinburgh in August. The festival starts on 2 August 2024 at 13:35

### Step 1

Create a datetime object with the Fringe date.

In [None]:
## your answer here



#### Step 1 Solution

In [68]:
## create the datetime object 
Fringe_date = dt.datetime(year = 2024, month = 8, day = 2, hour = 13, minute = 35)

### Step 2

Create a countdown date object using arithmetic from the fringe datetime until now. This will be a `timedelta` data type, which represents the time between 2 `datetime` instances.

<details><summary style='color:darkblue'>HINT: Useful functions. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

Use one the `dt.datetime` functions to get the date and time now, rather than hard coding it 

In [None]:
## your answer here



#### Step 2 solution

In [69]:
countdown = Fringe_date - dt.datetime.now()

In [70]:
type(countdown)

datetime.timedelta

### Step 3

Write an interpolating character string which will take our countdown object and tell us how many days until Fringe! 


In [None]:
## your answer here



#### Step 3 solution

In [71]:
print(f"Countdown to Fringe 2024: {countdown}")

Countdown to Fringe 2024: 91 days, 0:03:27.312009


### Step 4

We have a minumum viable product (MVP) for our task, which is great! *BUT* we can improve our countdown accuracy using timezones (i.e., aware objects)! Let's say we want a countdown specifically for someone living in California in the United States.  

Create a second datetime object for the Fringe date and set the correct time zone (i.e., Edinburgh).

In [None]:
## your answer here



#### Step 4 Solution

In [72]:
## set the timezone 
Fringe_date0 = Fringe_date.replace(tzinfo = tz.gettz("Europe/London"))

In [73]:
Fringe_date0 # confirm the object is now aware

datetime.datetime(2024, 8, 2, 13, 35, tzinfo=tzfile('/usr/share/zoneinfo/Europe/London'))

In [74]:
Fringe_date0.tzname() # great, BST as expected 

'BST'

### Step 5 

Now create a datetime object for the now time in California

<details><summary style='color:darkblue'>HINT: Useful functions. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

* Use one of the timezone `tz` functions we learned about to create time zones not reported by your system
* Use one the `dt.datetime` functions to get the date and time now, rather than hard coding it 

In [None]:
## your answer here



#### Step 5 solution

In [75]:
LA_tz = tz.gettz("America/Los_Angeles")
now0 = dt.datetime.now(tz = LA_tz)

In [76]:
# confirm now0 is aware 
now0

datetime.datetime(2024, 5, 3, 4, 31, 55, 178936, tzinfo=tzfile('/usr/share/zoneinfo/America/Los_Angeles'))

In [77]:
now0.tzname() # great, PDT as expected 

'PDT'

### Step 6 

Now we are ready again to create a second countdown date object using arithmetic with the aware datetime objects we have create for the fringe until now (in California, USA).

In [None]:
## your answer here



#### Step 6 Solution

In [78]:
countdown2 = Fringe_date0 - now0

### Step 7 

Final step, write an interpolating character string which will take our countdown object and tell us how many days until Fringe! 


In [None]:
## your answer here



#### Step 7 Solution

In [79]:
print(f"Countdown to Fringe 2024 from California, USA: {countdown2}")

Countdown to Fringe 2024 from California, USA: 91 days, 1:03:04.821064


### Bonus 

When creating date, time, or datetime objects, you can use the `parser.parse()` function from `dateutil` which takes a string and parses (reads) the date into Python for you!

In [80]:
## for example 
example_date = parser.parse("1 January 2024 1:00AM")

print(example_date)

2024-01-01 01:00:00


### Step 8 

Put it all together and instead of Fringe, use your next birthday! If you want to use aware datetime objects, guess which timezone you may be in on your birthday. Be sure to update the interpolating string to reflect the new countdown event

In [None]:
## your answer here



#### Step 8 Solution 

My birthday is on July 4th, so I have provided an example solution with that date. 

In [81]:
# create tz object 
EDI_tz = tz.gettz("Europe/London")

# create bday object 
Bday = parser.parse("July 4 2024, 00:00")

# make it aware 
Bday0 = Bday.replace(tzinfo = EDI_tz)

# create now dt object and make it aware
now1 = dt.datetime.now(tz = EDI_tz)

# create countdown object 
Bday_countdown = Bday0 - now1

# print 
print(f"Birthday Countdown: {Bday_countdown}")

Birthday Countdown: 61 days, 11:27:03.141301


---

## Well done! 🎉 

Well done! You have completed all of the tasks for the Python notebook for this tutorial. If you have not done so yet, now move to the R notebook.

---
*Dr. Brittany Blankinship (2024)*