# Session 4

- Handling text data
- RegEx
- Dates and times and timeseries data

## Strings

As you recall a String in Python is just a series of characters. You can subset them using the normal index based subsettign methods.

In [1]:
s = 'Hult'


In [2]:
# Getting the last character


In [3]:
# Getting every other character


### String methods

| Method | Description |
|---|---|
|`.capitalize()` | Capitalizes the first character |
|`.count()` | Counts the number of occurances of a string within another |
|`.startswith()`, `.endswith()` | True if the string begins/ends with a specified string |
|`.find()` | Smallest index where the string matches, -1 if no match |
|`.isalpha()` | True is all characters are alphabetic |
|`.isdecimal()` | True if all characters are decimal numbers |
|`.isalnum()` | True if all characters are alphanumerical |
|`.lower()`, `.upper()` | Returns a copy of the string with all lower-/uppercase |
|`.strip()` | Removes leading and trailing whitespaces |
|`.split()` | Returns a list of values split by a deliminator |

In [4]:
'Hult'

'Hult'

In [5]:
'Data Analytics'

'Data Analytics'

You can join Strings together using the `.join()` method. You use it on the String that should become the new separator between the parts.

In [7]:
['Hult', 'data', 'analytics', 'program']

['Hult', 'data', 'analytics', 'program']

In [8]:
['Hult', 'data', 'analytics', 'program']

['Hult', 'data', 'analytics', 'program']

In [9]:
['Hult', 'data', 'analytics', 'program']

['Hult', 'data', 'analytics', 'program']

## RegEx

See RegEx.ipynb


## Text Data

Very often we have to work on data in text form. Let's explore this based on a dataset holding reddit data (https://www.kaggle.com/datasets/mswarbrickjones/reddit-selfposts?resource=download)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv('rspct.tsv.zip', sep='\t')

It is already obvious that the data needs some cleaning to be useful.

In [10]:
#looking at one example text


It is clear that the text contains html tags which need to be replaced.

In [11]:
import re

# Recreate line endings


In [12]:
# replace &lt; with <, &gt; with > and &amp; with &


In [13]:
# as there are some double endings this needs to be repeated


In [14]:
# remove URLs


In [15]:
# Remove additional markdown


In [16]:
# Remove underscores


In [17]:
# Remove multiple quotes


In [70]:
# put it all together in a function
def clean_t(t):
    # Recreate line endings
    t = t.replace(r'<lb>', "\n")
    t = re.sub(r'<br */*>', "\n", t)
    # replace &lt; with <, &gt; with > and &amp; with &
    t = t.replace("&lt;", "<").replace("&gt;", ">").replace("&amp;", "&")
    # as there are some double endings this needs to be repeated
    t = t.replace("&amp;", "&")
    # remove URLs
    t = re.sub(r'\(*https*://[^\)]*\)*', "", t)
    # Remove additional markdown
    t = re.sub(r'\*', '', t)
    # Remove underscores
    t = re.sub(r'_+', ' ', t)
    # Remove multiple quotes
    t = re.sub(r'"+', '"', t)
    return t

In [18]:
# apply it to the Dataframe


## Dates and times

In many cases date and time information is a crucial part of a dataset.

Python has a special `datetime` object in the `datetime` library to handle this.

In [74]:
from datetime import datetime as dt

It can provide the current date and time:

You can also do math with datetime objects.

### Converting to datetime

When reading in datetime information from a file it is usually represented as a string. This string needs to be converted into a datetime object.

In [81]:
mv = pd.read_csv('Month_Value_1.csv')

The conversion to datetime has worked - sadly the month and day information seems switched.

You can also give explicit formating for the conversion.

The formatting options can be found here:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

### Getting date components

The datetime library allows to easily extract date and time components.

Using the `.dt` accessor this also works on a Dataframe.

### Timeseries data

A common example for timeseries data is the stock market. For easy access we will use the yahoo finance api.

In [115]:
pip install yfinance

Collecting yfinance
  Downloading yfinance-0.2.9-py2.py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.9/55.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting frozendict>=2.3.4
  Downloading frozendict-2.3.4-cp39-cp39-macosx_10_9_x86_64.whl (33 kB)
Collecting multitasking>=0.0.7
  Downloading multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Collecting html5lib>=1.1
  Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.2/112.2 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting appdirs>=1.4.4
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Installing collected packages: multitasking, appdirs, html5lib, frozendict, yfinance
Successfully installed appdirs-1.4.4 frozendict-2.3.4 html5lib-1.1 multitasking-0.0.11 yfinance-0.2.9
Note: you may need to restart the kernel to use updated packages.


In [1]:
import yfinance as yf

### Resampling

Often the frequency of a datetime needs to be changed to conduct the analysis needed.

There are three types of resampling:
- Downsampling (e.g. daily to monthly)
- Upsampling (e.g. monthly to daily)
- No change (e.g. from every first Monday of the month to every last Friday of the month)

In [19]:
# When downsampling we need to provide an aggregation function

# Average monthly values


The `.resample()` method is very powerful. Refer to https://towardsdatascience.com/using-the-pandas-resample-function-a231144194c4 for a good guide.