# # Regular Expressions
### <p style="color:Tomato">Learn about regular expressions by analyzing movie taglines<p/>
We'll be working with **a data set containing the top 1,000 questions users poste to AskReddit in 2015**. Reddit user P_S_Laplace created the data set, which has five columns that appear in the follosing order:
#### <p style="color:Gray">askreddit_2015.csv<p/>

### <p style="color:Gray">To parse strings<p/><hr>
- "Jan 17, 2012"
- "9/22/2005"
- "Spring 2007"
- "New Year's Eve 1999"<br/>
<br/>

We can handle a problem like this with regular expressions.

#### <p style="color:Gray"># Regular expressions<p/><hr>
A regular expression (regex) is a sequence of characters that describes a search pattern. We can use regular expression to search for and extract data. <br/>
<br/>

Regular expression은 검색 패턴을 설명하는 일련의 문자입니다. Regular expression을 사용하여 데이터를 검색하고 추출 할 수 있습니다.

####  <p style="color:Gray"> The same format <p/> <hr/>
#### <p style="color:Gray">1. An ordinary sequence of characters that we specify<p/>

In [1]:
# An ordinary sequence of characters that we specify
strings = ["data science", "big data", "metadata"]
regex = ""
regex = "data"

#### <p style="color:Gray">2. A number of special characters we can use with regular expressions<p/>
> To change the way a pattern is interpreted.<br/>
In Python, we use the re module to work with regular expressions. The module's documentation provides a list of these special characters.<br/>

<br/>

> 파이썬에서는 re module을 사용하여 정규 표현식을 처리합니다. 모듈의 문서는 이러한 특수 문자의 목록을 제공합니다.

#####  <p style="color:Gray">2-1. The special character "."<p/>
To indicate that any character can be put in its place.

In [2]:
strings = ["bat", "robotics", "megabyte"]
regex = "b.t"

#####  <p style="color:Gray"> 2-2. The Caret symbol "^" <p/>
To match the beginning of a string<br/>
"^a" : match all strings that start with "a"
#####  <p style="color:Gray"> 2-3. The dollar sign <p/> 
To match the end of a string.<br/>
"a$" : match all strings that end with "a"

In [3]:
strings = ["better not put too much", "butter in the", "batter"], 
bad_string = "We also wouldn't want it to be bitter"
regex = "^b.tter"

In [4]:
import csv
f = open('askreddit_2015.csv', 'r', encoding='UTF8')
posts = list(csv.reader(f))
''''_csv.reader' object is not subscriptable
As the error message says, csv readers don't support indexing
The value returned by csv.reader is not a list; it's an iterator over the rows.
If you want, you could make a list of all the rows'''
''''cp949' codec can't decode byte 0xe2 in position 2179: illegal multibyte sequence
encoding = UTF8'''
posts[:6]
posts_with_header = posts[0]
print(posts_with_header)
posts = posts[1:]

['Title', 'Score', 'Time', 'Gold', 'NumComs']


In [5]:
print(posts[:10])

[['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195'], ["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479'], ['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055'], ["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201'], ['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325'], ['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389'], ["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520'], ['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '143882

In [6]:
import csv
f = open('askreddit_2015.csv', 'r', encoding='UTF8')
posts = list(csv.reader(f))
''''_csv.reader' object is not subscriptable
As the error message says, csv readers don't support indexing
The value returned by csv.reader is not a list; it's an iterator over the rows.
If you want, you could make a list of all the rows'''
''''cp949' codec can't decode byte 0xe2 in position 2179: illegal multibyte sequence
encoding = UTF8'''
posts[:6]
posts_with_header = posts[0]
print(posts_with_header)
posts = posts[1:]

['Title', 'Score', 'Time', 'Gold', 'NumComs']


In [7]:
for post in posts[:10]:
    print(post)

['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195']
["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479']
['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055']
["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']
['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325']
['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389']
["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520']
['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '1438822288.0',

In [8]:
import re
if re.search("needle", "haystack") is not None:
    print("We found it!")
### <p style="color:Tomato">Learn about regular expressions by analyzing movie taglines<p/>else: print("Not a match")

### <p style="color:Tomato">re.search()<p/>

> You may have noticed that many of the posts in our AskReddit database are directed towards particular groups of people, **using phrases** like "Soldiers of Reddit". These types of posts are common, and always follow a similar format. We can use regular expressions to count how many of them are in the top 1,000
<br/>

> AskReddit 데이터베이스에있는 많은 게시물이 "Reddit Soldiers"와 같은 문구를 사용하여 특정 그룹의 사람들에게 전달된다는 사실을 알고 계실 것입니다. 이러한 유형의 게시물은 공통적이며 항상 유사한 형식을 따릅니다. 정규 표현식을 사용하여 상위 1,000 개에 몇 개가 있는지 계산할 수 있습니다.

In [9]:
of_reddit_count = 0
for row in posts:
    if re.search("of Reddit", row[0]) is not None:
        of_reddit_count += 1

In [10]:
of_reddit_count

76

#####  <p style="color:Gray"> 2-4. square brackets <p/> 
To indicate that any character within them can fill the space.<br/>
"[bcr]at" : would match the substrings "bat", "cat", "rat"

In [11]:
# "of Reddit" and "of reddit"
of_reddit_count = 0
for row in posts:
    if re.search("of [Rr]eddit", row[0]):
        of_reddit_count += 1

In [12]:
of_reddit_count

102

#####  <p style="color:Gray"> 2-5. Escape special characters "\"<p/> 
In regular expresiions, escaping a character means indicating that you don't want the hcaracter to do anything special, and that the interpreter should treat it just like any other character.<br/>
<br/>
".$" : To match all of the strings that end with a period.

#####  <p style="color:Gray"> 2-6. ".$" <p/>
To match all of the strings that end with a period.

In [13]:
print(posts_with_header)

['Title', 'Score', 'Time', 'Gold', 'NumComs']


In [14]:
for row in posts[:10]:
    print(row)

['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195']
["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479']
['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055']
["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']
['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325']
['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389']
["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520']
['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '1438822288.0',

In [15]:
serious_count = 0
for row in posts:
    if re.search("\[Serious\]", row[0]) is not None:
        serious_count += 1

print(serious_count) 

69


In [16]:
serious_count = 0
for row in posts:
    if re.search("[\[[Ss]erious\]", row[0]) is not None:
        serious_count += 1

print(serious_count)

77


In [17]:
serious_count = 0
for row in posts:
    if re.search("[\[\(][Ss]erious[\]\)]", row[0]) is not None:
        serious_count += 1

print(serious_count)

80


#####  <p style="color:Gray"> 2-7. "|" <p/>
To combine regular expressions <br/>
"cat|dog" : match "catfish" and "hotdog", Because both of these strings match either "cat" or "dog"

In [18]:
serious_start_count = 0
for row in posts:
    if re.search("^[\[\(][Ss]", row[0]):
        serious_start_count += 1
print(serious_start_count)

69


In [19]:
serious_end_count = 0
for row in posts:
    if re.search("[\]\)]$", row[0]):
        serious_end_count += 1
print(serious_end_count)

12


In [20]:
serious_count_final = 0
for row in posts:
    if re.search("^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$", row[0]):
        serious_count_final += 1
print(serious_count_final)

80


In [21]:
serious_start_count = 0
serious_end_count = 0
serious_count_final = 0
for row in posts:
    if re.search("^[\[\(][Ss]erious[\]\)]", row[0]):
        serious_start_count += 1
    if re.search("[\[\(][Ss]erious[\]\)]$", row[0]):
        serious_end_count += 1
    if re.search("^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$", row[0]):
        serious_count_final += 1
print(serious_start_count)
print(serious_end_count)
print(serious_count_final)

69
11
80


### <p style="color:Tomato">re.sub()<p/>
The re module provides a sub() function that takes the following parameters (in order):
* pattern: The regex to match
* repl: The string that should replace the substring matches
* string: The string containing the pattern we want to search<br/>


If it doesn't find a pattern, the re.sub() function simply returns the original string.<br/>

Let's use re.sub() to convert all serious tags to the format "[Serious]"



In [22]:
re.sub("yo", "hello", "yo world")

'hello world'

In [23]:
for row in posts[:20]:
    print(row[0])

What's your internet "white whale", something you've been searching for years to find with no luck?
What's your favorite video that is 10 seconds or less?
What are some interesting tests you can take to find out about yourself?
PhD's of Reddit. What is a dumbed down summary of your thesis?
What is cool to be good at, yet uncool to be REALLY good at?
[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?
Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?
What is a good subreddit to binge read the All Time Top Posts of?
What would the person who named Walkie Talkies have named other items?
People who grew up in a different socioeconomic class as your significant others, what are the notable differences you've noticed and how does it affect your relationship (if at all)?
What are the best YouTube channels to binge watch ?
What website c

In [24]:
for row in posts:
    row[0] = re.sub("[\[\(][Ss]erious[\]\)]", "[Serious]", row[0])

In [25]:
for row in posts[:20]:
    print(row[0])

What's your internet "white whale", something you've been searching for years to find with no luck?
What's your favorite video that is 10 seconds or less?
What are some interesting tests you can take to find out about yourself?
PhD's of Reddit. What is a dumbed down summary of your thesis?
What is cool to be good at, yet uncool to be REALLY good at?
[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?
Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?
What is a good subreddit to binge read the All Time Top Posts of?
What would the person who named Walkie Talkies have named other items?
People who grew up in a different socioeconomic class as your significant others, what are the notable differences you've noticed and how does it affect your relationship (if at all)?
What are the best YouTube channels to binge watch ?
What website c

In [26]:
strings = ['War of 1812', 'There are 5280 feet to a mile', 'Happy New Year 2016!']

In [27]:
year_strings = []
for item in strings:
    if re.search("[1-2][0-9][0-9][0-9]", item) is not None:
        year_strings.append(item)
print(year_strings)

['War of 1812', 'Happy New Year 2016!']


#####  <p style="color:Gray"> 2-8. "{}" <p/>
To indicate that a pattern should repeat.<br/>

"[0-9][0-9][0-9][0-9]" : [0-9]{4}

None값이 error를 낼 것 같아서 미리 처리해 준 것이 is not None

In [28]:
year_strings = []
for item in strings:
    if re.search("[1-2][0-9]{3}", item) is not None:
        year_strings.append(item)
print(year_strings)

['War of 1812', 'Happy New Year 2016!']


### <p style="color:Tomato">re.findall()<p/>
A findall() function that returns a list of substrings matching the regex.

In [29]:
re.findall("[a-z]", "abc123")

['a', 'b', 'c']

In [30]:
print(year_strings)

['War of 1812', 'Happy New Year 2016!']


In [31]:
import re
# years = re.findall("[1-2][0-9]{3}", year_strings)
# expected string or bytes-like object

In [32]:
type(year_strings[0])

str

In [33]:
type(year_strings)

list

In [34]:
years_string = ''.join(year_strings)
type(years_string)

str

In [35]:
years = re.findall("[1-2][0-9]{3}", years_string)
print(years)

['1812', '2016']


If False, re.search() will return None

In [36]:
print(help(re.search))

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.

None


# # Dates in Python
### <p style="color:Tomato">Use Dates and times in Python to analyze AskReddit data<p/>

##### <p style="color:Gray">askreddit_2015.csv<p/>

#### <p style="color:Gray">A Unix timestamp<p/>
A floating point value with no explicit mention of day, month or year.<br/>
This value represents the number of seconds that have passed since the "epoch", or the first second of the year 1970.<br/>
A timestamp of 0.0  : the epoch <br/>
A timestamp of 60.0 : one minute after the epoch.<br/>
Can represent any date date after 1970 this way.<br/>
<br/>

Unix Time Stamp는 명시적으로 일, 월 또는 년을 언급하지 않은 부동 소수점 값입니다. 이 값은 epoch 또는 1970년의 첫 번째 초 이후에 경과한 초수를 나타냅니다. 

#### <p style="color:Gray"># time.time()<p/>
To retrieve the current Unix timestamp

In [37]:
import time
current_time = time.time()
print(current_time)

1508570029.1339297


#### <p style="color:Gray"># time.gmtime()<p/>
Convert a timestamp to a more human-readable form. This function takes a timestamp as an argument, and returns an instance of <span style="color:Tomato">the struct_time</span> class.<br/>
Struct_time instances have attributes that represent the current time in other ways.

In [38]:
current_time = time.time()
current_struct_time = time.gmtime(current_time)
current_year = current_struct_time.tm_year
print(current_time)
print(current_struct_time)
print(current_year)

1508570029.1635053
time.struct_time(tm_year=2017, tm_mon=10, tm_mday=21, tm_hour=7, tm_min=13, tm_sec=49, tm_wday=5, tm_yday=294, tm_isdst=0)
2017


In [39]:
current_time = time.time()
current_struct_time = time.gmtime(current_time)
current_hour = current_struct_time.tm_hour
print("current_time")
print(current_time)
print("current_struct_time")
print(current_struct_time)
print("current_hour")
print(current_hour)

current_time
1508570029.192583
current_struct_time
time.struct_time(tm_year=2017, tm_mon=10, tm_mday=21, tm_hour=7, tm_min=13, tm_sec=49, tm_wday=5, tm_yday=294, tm_isdst=0)
current_hour
7


#### <p style="color:Gray"># datetime.datetime.utcnow()<p/>

In [40]:
import datetime

In [41]:
nye_2017 = datetime.datetime(year=2017, month=12, 
                             day=31, hour=12, 
                             minute=59, second=59)
print(nye_2017)

2017-12-31 12:59:59


In [42]:
current_datetime = datetime.datetime.utcnow()
current_year = current_datetime.year
current_month = current_datetime.month
print(current_datetime)

2017-10-21 07:13:49.298865


#### <p style="color:Gray">#datetime.timedelta() <p/>
To perform arithmetic on dates

In [43]:
today = datetime.datetime.now()
print(today)

diff = datetime.timedelta(weeks = 3, days = 2)
print(diff)

future = today + diff
print(future)

past = today - diff
print(past)

2017-10-21 16:13:49.329447
23 days, 0:00:00
2017-11-13 16:13:49.329447
2017-09-28 16:13:49.329447


In [44]:
kirks_birthday = datetime.datetime(year=2233, month=3, day=22)
diff = datetime.timedelta(weeks=15)
before_kirk = kirks_birthday - diff
print(before_kirk)

2232-12-07 00:00:00


#### <p style="color:Gray">#datetime.strftime() <p/>
To specify how we'd like the string output to be formatted.<br/>
> The datetime.datetime.strftime() method takes a format string as its input. A format string contains special indicators, usually preceded by percent characters ("%"), that indicate where certain values should go.

<br/>
> datetime.datetime.strftime() 메서드는 형식 문자열을 입력으로 사용합니다. Format string에는 특정 값을 가져올 위치를 나타내는 special indicators가 있습니다. 일반적으로는 "퍼센트 기호"가 앞에 붙습니다.

<br/>

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [45]:
march3 = datetime.datetime(year = 2010, month = 3, day = 3)
pretty_march3 = march3.strftime("%b %d, %Y")
print(pretty_march3)

Mar 03, 2010


In [46]:
mystery_date = datetime.datetime(
                                 2015, 12, 31, 0, 0)

In [47]:
mystery_date_formatted_string = mystery_date.strftime(
    "%I:%M%p on %A %B %d, %Y")

In [48]:
print(mystery_date_formatted_string)

12:00AM on Thursday December 31, 2015


#### <p style="color:Gray">#datetime.strptime() <p/>
To convert a string to a datetime instance
1. The date string (e.g. "Mar 03, 2010")
2. The format string (e.g. "%b %d, %Y")

This is useful if we have a date in a string format, and need to convert it to a datetime instance

In [49]:
march3 = datetime.datetime.strptime("Mar 03, 2010", "%b %d, %Y")
print(march3)

2010-03-03 00:00:00


In [50]:
mystery_date = datetime.datetime.strptime(mystery_date_formatted_string, "%I:%M%p on %A %B %d, %Y")
print(mystery_date)

2015-12-31 00:00:00


#### <p style="color:Gray">Covert Unix time stamp into datetime object<p/>
Time column are formatte as Unix timestamps

In [51]:
posts_with_header

['Title', 'Score', 'Time', 'Gold', 'NumComs']

In [52]:
posts[:5]

[['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?',
  '11510',
  '1433213314.0',
  '1',
  '26195'],
 ["What's your favorite video that is 10 seconds or less?",
  '8656',
  '1434205517.0',
  '4',
  '8479'],
 ['What are some interesting tests you can take to find out about yourself?',
  '8480',
  '1443409636.0',
  '1',
  '4055'],
 ["PhD's of Reddit. What is a dumbed down summary of your thesis?",
  '7927',
  '1440188623.0',
  '0',
  '13201'],
 ['What is cool to be good at, yet uncool to be REALLY good at?',
  '7711',
  '1440082910.0',
  '0',
  '20325']]

In [53]:
for row in posts:
    row[2] = datetime.datetime.fromtimestamp(float(row[2]))
print(posts[:5])    

[['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', datetime.datetime(2015, 6, 2, 11, 48, 34), '1', '26195'], ["What's your favorite video that is 10 seconds or less?", '8656', datetime.datetime(2015, 6, 13, 23, 25, 17), '4', '8479'], ['What are some interesting tests you can take to find out about yourself?', '8480', datetime.datetime(2015, 9, 28, 12, 7, 16), '1', '4055'], ["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', datetime.datetime(2015, 8, 22, 5, 23, 43), '0', '13201'], ['What is cool to be good at, yet uncool to be REALLY good at?', '7711', datetime.datetime(2015, 8, 21, 0, 1, 50), '0', '20325']]


In [54]:
march_count = 0
for row in posts:
    if row[2].month == 3:
        march_count += 1

In [55]:
print(march_count)

58


In [57]:
feb_count = 0
aug_count = 0
for row in posts:
    if row[2].month == 2:
        feb_count += 1
    elif row[2].month == 8:
        aug_count += 1
print(feb_count)
print(aug_count)

45
93
