# Refine the Data

In [66]:
import pandas as pd

In [67]:
df = pd.read_csv('data_tau.csv')

In [68]:
df.head()

Unnamed: 0,title,date
0,Deep Advances in Generative Modeling,6 points by gwulfs 5 hours ago | discuss
1,A Neural Network in 11 lines of Python,2 points by dekhtiar 5 hours ago | discuss
2,"Python, Machine Learning, and Language Wars",3 points by pmigdal 7 hours ago | discuss
3,Markov Chains Explained Visually,11 points by zeroviscosity 1 day ago | 1 comment
4,Dplython: Dplyr for Python,10 points by thenaturalist 1 day ago | 3 comm...


To get the date of the title - we will need the following algorithm
- If the string contains **hours** we can consider it **1 day**
- And if the string has **day**, we pick the number preceding the **day**

To apply this algorithm, we need to be able to pick these words and digits from a string. For that we will use Regular Expression.

## Introduction to Regular Expression (Regex)

Regular expression is a way of selecting text using symbols in a string.

Refer to the following links for an interactive playground
- [http://regexr.com](http://regexr.com/)
- [http://regex101.com/](http://regex101.com/)

In [1]:
import re

In [2]:
test_string = "Hello world, welcome to 2016."

In [112]:
# We can pass the whole string and re.search will give the first occurence of the value
# re.search - This function searches for first occurrence of RE pattern within string.
a = re.search('Hello world, welcome to 2016',test_string)
a.group()

'Hello world, welcome to 2016'

In [109]:
# Match the first letters in the string
a = re.search('.',test_string)
a.group()

'H'

In [108]:
# Match all the letters in the string
a = re.search('.*',test_string)
a.group()

'Hello world, welcome to 2016.'

In [11]:
a = re.search('Hello',test_string)
print(a)

<_sre.SRE_Match object; span=(0, 5), match='Hello'>


** Some basic symbols**

**`?`**   

The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

**`\*`**

The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

**`\+`**	
The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".


In [21]:
a = re.search('\w.',test_string)
print(a)

<_sre.SRE_Match object; span=(0, 2), match='He'>


In [22]:
a = re.search('\w*',test_string)
print(a)

<_sre.SRE_Match object; span=(0, 5), match='Hello'>


### Exercises

In [89]:
string = "In 2016, we are learning Text Analytics in Data Science 101 by scraping http://datatau.com"

Write a regex to pick the numbers 2016 from string above.

Write a regex to pick the url link from the string above 

## Lets get the date from our string

In [25]:
df.head()

Unnamed: 0,title,date
0,Deep Advances in Generative Modeling,6 points by gwulfs 5 hours ago | discuss
1,A Neural Network in 11 lines of Python,2 points by dekhtiar 5 hours ago | discuss
2,"Python, Machine Learning, and Language Wars",3 points by pmigdal 7 hours ago | discuss
3,Markov Chains Explained Visually,11 points by zeroviscosity 1 day ago | 1 comment
4,Dplython: Dplyr for Python,10 points by thenaturalist 1 day ago | 3 comm...


In [20]:
df.tail()

Unnamed: 0,title,date
175,Finding Nearest Neighbors in SQL,7 points by jonathanbishop 50 days ago | 2 co...
176,Free Data Science Curriculum,8 points by Veerle 52 days ago | discuss
177,Deep Learning is Easy - Learn Something Harder,5 points by ebellm 48 days ago | discuss
178,"FlyElephant as a tool for calculations in C++,...",6 points by m31 51 days ago | discuss
179,Yahoo Releases the Largest-ever Machine Learni...,13 points by srinify 62 days ago | 1 comment


In [43]:
date_string = df['date'][0]

In [44]:
print(date_string)

6 points by gwulfs 5 hours ago  | discuss


In [45]:
re.search('hours',date_string)

<_sre.SRE_Match object; span=(21, 26), match='hours'>

In [46]:
date_string = df['date'][50]

In [47]:
print(date_string)

12 points by carlosgg 14 days ago  | discuss


In [49]:
# If hours is not there, we don't get any match
re.search('hours',date_string)

In [50]:
# Let us match the digit preceding the day text
day_search = re.search('\d+ day',date_string)
day_search

<_sre.SRE_Match object; span=(22, 28), match='14 day'>

In [59]:
days_string = day_search.group(0)
days_string

'14 day'

In [60]:
days = days_string.split(' ')[0] 
days

'14'

In [61]:
def return_reg_ex_days(row):
    days = ''
    if re.search('hours',row['date']) is not None:
        # print('hours',row['date'])
        days = 1
    else:
        day_search = re.search('\d+ day',row['date'])
        # print('day',day_search.group(0))
        days = day_search.group(0).split(' ')[0]    
    
    #print(row,days)
    return days
        

In [62]:
# Now we apply this function to each of the row in the dataframe
df['days'] = df.apply(return_reg_ex_days,axis=1)

In [63]:
df.head()

Unnamed: 0,title,date,days
0,Deep Advances in Generative Modeling,6 points by gwulfs 5 hours ago | discuss,1
1,A Neural Network in 11 lines of Python,2 points by dekhtiar 5 hours ago | discuss,1
2,"Python, Machine Learning, and Language Wars",3 points by pmigdal 7 hours ago | discuss,1
3,Markov Chains Explained Visually,11 points by zeroviscosity 1 day ago | 1 comment,1
4,Dplython: Dplyr for Python,10 points by thenaturalist 1 day ago | 3 comm...,1


In [64]:
df.tail()

Unnamed: 0,title,date,days
175,Parallel scikit-learn on YARN,5 points by stijntonk 39 days ago | discuss,39
176,Meetup: Free Live Webinar on Prescriptive Anal...,2 points by ann928 32 days ago | discuss,32
177,Access to VK.com (Vkontakte) API via R,2 points by dementiy 32 days ago | discuss,32
178,\tDeep Learning Tutorial by Y. LeCun and Y. B...,15 points by Anon84 50 days ago | 1 comment,50
179,Machine Learning Meets Economics,20 points by nicolaskruchten 55 days ago | di...,55


In [65]:
# Let us save to a dataframe
df.to_csv('data_tau_days.csv', index=False)