# CME538 - Introduction to Data Science
## Lecture 4.1 - Working with Text and Datetimes

### Lecture Structure
1. [Demo 1](#section1)
2. [Demo 2](#section2)
3. [Demo 3](#section3)
4. [Demo 4](#section4)

## Setup Notebook

In [None]:
# Import 3rd party libraries
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

<a id='section1'></a>
## 1. Demo 1 - Ontario Region Data

In [None]:
# Import Region-Population CSV.
region_population = pd.read_csv('region_population.csv')
region_population.head()

In [None]:
# Import Region-Province CSV.
region_province = pd.read_csv('region_province.csv')
region_province.head()

Write a function to map to a consistent string representation of **Region**.

In [None]:
def canonicalization_region(region):
    return (
        region.lower()
        .replace(' ', '')
        .replace('-', '')
        .replace('county', '')
        .replace('regional', '')
        .replace('region', '')
        .replace('of', '')
        .replace('municipality', '')
    )

Add a new cleaned column for **Region**.

In [None]:
region_population['clean_region'] = region_population['region'].map(canonicalization_region)
region_population.head()

In [None]:
region_province['clean_region'] = region_province['region'].map(canonicalization_region)
region_province.head()

In [None]:
region_population.merge(region_province[['province', 'clean_region']], on='clean_region', how='right')

<a id='section2'></a>
## 2. Demo 2 - Log Data

Extract the Day, Month, Year, Hour, Minute, Second and Time Zone from log data using Python string methods.

Read the text file.

In [None]:
with open('log.txt', 'r') as f:
    logs = f.read().split('\n')
print(logs)

In [None]:
print(logs[0], '\n')
print(logs[1], '\n')
print(logs[2], '\n')

`logs` is now a list of strings where each string is a log.

We could try simply indexing the strings. It looks like the Year is character 27:31. Let's try extracting the Year.

In [None]:
# Log 1
logs[0][27:31]

It worked! Let's try it for the second log.

In [None]:
# Log 2
logs[1][27:31]

We can see that this doesn't generalize. Lets try using Python string methods.

First, let's grab the section of the log we're interested in using the `.split()` method.

In [None]:
logs[0]

In [None]:
text = logs[0].split('[')[1].split(']')[0]
text

We can see that this works for all logs.

In [None]:
for log in logs:
    print(log.split('[')[1].split(']')[0])

Now we can use `.split()` again to get the day and month and remainder.

In [None]:
day, month, remainder = text.split('/')
print('day: {}, month: {}, remainder: {}'.format(day, month, remainder))

In [None]:
year, hour, minute, remainder = remainder.split(':')
print('year: {}, hour: {}, minute: {}, remainder: {}'.format(year, hour, minute, remainder))

In [None]:
seconds, time_zone = remainder.split(' ')
print('second: {}, time_zone: {}'.format(seconds, time_zone))

Now, lets try extracting for each log and savings to a DataFrame.

In [None]:
def log_parser(log):
    text = log.split('[')[1].split(']')[0]
    day, month, remainder = text.split('/')
    year, hour, minute, remainder = remainder.split(':')
    seconds, time_zone = remainder.split(' ')
    return {'day': day, 'month': month, 'year': year, 
            'hour': hour, 'minute': minute, 'seconds': seconds, 
            'time_zone': time_zone}

In [None]:
[log_parser(log) for log in logs]

In [None]:
data = pd.DataFrame([log_parser(log) for log in logs])
data.head()

Try the same thing using regular expressions.

In [None]:
import re

pattern = r'\[(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, logs[0])[0]

print(year, month, day, hour, minute, second, time_zone)

<a id='section3'></a>
## 3. Demo 3 - DateTime Index

First, let's import Python's built-in Datatime package.

In [None]:
from datetime import datetime

To start, let's generate some variables to hold datetime information.

In [None]:
year = 2020
month = 11
day = 12
hour = 15
minute = 10
second = 32
microsecond = 2304

Now, create a datetime object. Hold shift + tab to see what argurments `datetime()` takes.

In [None]:
date_time = datetime(year=year, month=month, day=day, 
                     hour=hour, minute=minute, second=second, 
                     microsecond=microsecond)
date_time

The `datetime()` class has many useful methods. Type `date_time.` and press tab.

In [None]:
date_time.year

In [None]:
date_time.hour

In [None]:
date_time.weekday()

`.weekday()` returns the day of the week as an integer, where Monday is 0 and Sunday is 6. Weekday = 3 is Thursday!

We can also format the datetime object as a string. Visit this [website](https://www.w3schools.com/python/python_datetime.asp) for a reference of all the legal format codes.

**Weekday, short version**

In [None]:
date_time.strftime('%a')

**Weekday, full version**

In [None]:
date_time.strftime('%A')

**Timezone**

In [None]:
date_time.strftime('%Z')

Hmm, this is interesting. Why is there no time zone information?

Remember the `datetime.datetime()` class has a time zone argument which is set to `tzinfo=None` by default.

To make a datetime object have a time zone, you can use the pytz library.

In [None]:
import pytz

First, lets use `pytz` to create a time zone object.

In [None]:
time_zone = pytz.timezone('Canada/Eastern')

Next, we can apply the time zone to our naive datetime.

In [None]:
date_time_aware = time_zone.localize(date_time)
date_time_aware

<a id='section4'></a>
## 4. Demo 4 - Datetimes in Pandas

Typically, we deal with time series as a datetime index when working with a Pandas DataFrame. Pandas has a lot of functions and methods to work with time series that you can check out [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html).

### DatetimeIndex
Let's start by creating a dummy DataFrame. We can use the `pd.date_range()` function to create a Pandas `DatetimeIndex`.

In [None]:
idx = pd.date_range(start='12/31/2020', periods=10, freq='Y')
idx

In this case, we've created a DatetimeIndex starting at `12/31/2020`, lasting for 10 periods at a frequency of every day.

In [None]:
idx = pd.date_range(start='12/31/2020', periods=10, freq='D')
idx

Another option is to convert existing datetime information into a `DatetimeIndex`.

In [None]:
idx = pd.to_datetime(['November 21, 2020','4/3/19',
                      '10-Feb-2012', None, 10.34], format='mixed')
idx

Question: Why did `10.34` successfully convert to a datetime?

Ok, let's create a DataFrame.

In [None]:
idx = pd.date_range(start='12/11/2020', periods=1000, freq='D')
idx

In [None]:
import numpy as np
data = pd.DataFrame(index=idx, 
                    data=np.random.rand(1000, 2), 
                    columns=['Var1', 'Var2'])
data.head()

Now we can use Pandas index operations.

In [None]:
data.index.min()

In [None]:
data.index.max()

### Time Resampling

Next, we'll explore Pandas `.resample()` method.

Let's import the Uber ride data from Lecture 3.1.

In [None]:
# Import data
uber_data = pd.read_csv('uber-raw-data-jun14.csv')
uber_data.head()

In [None]:
uber_data.info()

In [None]:
# Set 'Date/Time' column as index
uber_data = uber_data.set_index('Date/Time')
uber_data.head()

In [None]:
# Convert index to DatetimeIndex
uber_data.index = pd.DatetimeIndex(uber_data.index)
uber_data.head()

We know that this DataFrame contains ride data from June 2014.

Let's say we want to generate a plot showing the number of ride's per day in June. We can do this using `.resample()`.

When calling .resample() you need to specify a rule parameter and then you need to call an aggregation function such as count, sum, mean, etc.

The rule parameter describes the frequency to apply the aggregation function (daily, monthly, etc.).

There are many rules as seen [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) under `Offset aliases`.



In [None]:
# Daily count
counts = uber_data.index.value_counts().resample('D').sum()
counts

Now, let's try plotting it.

In [None]:
title = 'Uber Rides Per Day, June 2014'
counts.plot.bar(figsize=(15, 5), title=title)
plt.show()

Question: What could be causing the cyclical pattern?

Now, let's name is nicer.

In [None]:
def line_format(label):
    """
    Convert time label to the format of pandas line plot
    """
    day = label.day_name()[:3]
    if day == 'Jan':
        return day[0] + f'\n\n{label.day}'
    else:
        return day[0]

counts = uber_data.index.value_counts().resample('D').sum()

plt.title('Daily Uber Rides in New York City (June 2004)', fontsize=18)
ax = counts.plot.bar(figsize=(15, 5), rot=0)
ax.set_xticklabels(map(line_format, counts.index))
ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)
ax.set_xlabel('Date-Time', fontsize=18)
ax.set_ylabel('Rides (thousands)', fontsize=18)
plt.savefig(r'C:\Users\seb\Desktop\1.png')
plt.show()