# TOC

[Week 1](#Week-1) <br>
[Week 2](#Week-2) <br>
[Week 3](#Week-3) <br>
[Week 4](#Week-4) <br>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
path = r'D:/Coursera/Applied Data Science with Python/Applied Text Mining/Notebooks/'

## Notes

### **Regex**

* `.` : matches any single character
* `^` : start of a string
* `$`: end of a string
* `[]` : matches one of the set of characters in the []
* `[a-z]` : matches any character a,b,c,....z
* `[^abc]` : matches any character other than a,b,c
* `a|b` : matches either a or b
* `()` : scoping for operators, equivalent to finding only the substring in the () and returning it; to be able to return the whole string instead of the substring in the () we need to use (?:), which means that it matches 0 ir 1 occurence of the substring inside (), For eg. 23 Oct 2002 or Oct 23, 2002 can all be matched with `(?:\d{,2} )?(?:Jan|Feb....|Dec)\w*(?:\d{1,2}, )?\d{2,4}`
* `\` : escape character for special characters (\t,\n,\b)
* `\b` : matches word boundary
* `\d` : any digit, equivalent to [0-9]
* `\D` : any non-digit, equivalent to [^0-9]
* `\s` : any whitespace, equivalent to [\t,\n,\r,\f,\v]
* `\S` : any non-whitespace, equivalent to [^ \t,\n,\r,\f,\v]
* `\w` : Alphanumeric character, equivalent to [a-zA-Z0-9_]
* `\W` : Non-alphanumeric character, equivalent to [^ a-zA-Z0-9_]
* `*` : Matches 0 or more occurrences 
* `+` : Matches 1 or more occurrences
* `?` : Matches 0 or 1 occurrence
* `{n}` : Exactly n repetitions, n>=0
* `{n,}` : At least n repetitions
* `{,n}` : At most n repetitions
* `{m,n}` : At least m and at most n repetitions

### Finding duplicate indices

In [425]:
date7_temp.index[date7_temp.index.duplicated()]

Int64Index([], dtype='int64')

### Archiving a file

To tar a big file

In [14]:
# locate the file; ls works on linux, replace it with dir for windows
!ls -l Amazon_Unlocked_Mobile.csv

lrwxrwxrwx 1 nobody nogroup 47 Nov 29 13:57 Amazon_Unlocked_Mobile.csv -> /home/jovyan/work-ro/Amazon_Unlocked_Mobile.csv


In [13]:
# tar the file; this is a twice archived file and to open it we will need 7-zip or winrar on windows
# don't worry about the warning below about removing leading / 
!tar zcvf Amazon_Unlocked_Mobile.csv.tar.gz /home/jovyan/work-ro/Amazon_Unlocked_Mobile.csv

tar: Removing leading `/' from member names
/home/jovyan/work-ro/Amazon_Unlocked_Mobile.csv


In [15]:
# check the archived file size
!ls -l Amazon_Unlocked_Mobile.csv.tar.gz

-rw-r--r-- 1 jovyan users 33428412 Dec 13 08:35 Amazon_Unlocked_Mobile.csv.tar.gz


In [None]:
# create a zipfile
zf = zipfile.ZipFile(path+'amazon.zip', 'w')
# write the file in the zip and compress it 
zf.write('Amazon_Unlocked_Mobile.csv', compress_type=zipfile.ZIP_DEFLATED)
zf.close()

# Week 1

##  Working with text

In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [3]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [6]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [7]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [8]:
len(set(text4))

5

In [9]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [10]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [11]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

In [27]:
[w.title() for w in text4]

['To', 'Be', 'Or', 'Not', 'To', 'Be']

In [29]:
text5 = 'ouagadougou'
text5.split('ou')
#  to get individual characters of a string 
list(text5),[c for c in text5]

(['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u'],
 ['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u'])

## Processing free-text(Regex)

In [12]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [32]:
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"
[x for x in tweet.split(' ') if x.startswith('#')]

['#regex', '#pandas', '#python']

In [13]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [14]:
[w for w in text6 if w.startswith('@')]

['@']

In [15]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)
* `+` indicates one or more characters inside the [] brackets

In [23]:
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

In [33]:
[w for w in text8 if re.search('@\w+',w)] # the same regex as above but more concise

['@UN', '@UN_Women']

In [36]:
#  find all vowels
text5,re.findall('[aeiou]',text5)

('ouagadougou', ['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u'])

In [41]:
# find all consonants
text5,re.findall('[^aeiou]',text5)

('ouagadougou', ['g', 'd', 'g'])

**Some date variations of 23rd October 2002** <br>
* 23-10-2002
* 23/10/2002
* 23/10/02
* 10/23/2002
* 23 Oct 2002
* 23 October 2002
* Oct 23, 2002
* October 23, 2002

In [296]:
date_str = ' 23-10-2002  23/10/2002 23/10/02 10/23/2002 23 Oct 2002 23 October 2002 Oct 23, 2002 October 23, 2002'
re.findall('\d{1,2}\W+\d{1,2}\W+\d{2,4}',date_str)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [163]:
# for finding the dates with month specified
list(map(str.strip,re.findall('(?:\d{,2} )?[a-zA-Z]+ (?:\d{,2}, )?\d{2,4}',date_str)))

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [164]:
re.findall('(?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{,2}, )?\d{2,4}',date_str)

['23 Oct 2002', '23 October 2002', ' Oct 23, 2002', ' October 23, 2002']

## Text in Pandas

In [165]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [166]:
# find the number of characters for each string in df['text']
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [167]:
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [168]:
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [169]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [171]:
# find all occurances of the digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [172]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [173]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [226]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [282]:
# replace weekdays with 3 letter abbrevations; only works with one group defined. here we are replacing appointment
df['text'].str.replace(r'(\w+ment\b)', lambda x: x.groups()[0][::2])

0            Monday: The doctor's apitet is at 2:45pm.
1        Tuesday: The dentist's apitet is at 11:30 am.
2    Wednesday: At 7:00pm, there is a basketball game!
3    Thursday: Be back home by 11:15 pm at the latest.
4    Friday: Take the train at 08:10 am, arrive at ...
Name: text, dtype: object

In [284]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [286]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [229]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


## Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [None]:
import pandas as pd,numpy as np, re, datetime
from functools import reduce
path = 'D:/Coursera/Applied Data Science with Python/Applied Text Mining/Notebooks/'
doc = []
with open(path+'dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

**For the older version of Pandas use the code below. This might be needed for you to complete the assignment as the grader considers the older version of Pandas**

In [None]:
def date_sorter():
    
    # Your code here
    date1 = (df.str.extractall('((\d{1,2}/)(\d{1,2}/)(\d{2,4}))'))
    date1[1] = [i.zfill(2) for i in date1[1].str.replace('/','')]
    date1[2] = [i.zfill(2) for i in date1[2].str.replace('/','')]
    date1[3] = ['19'+str(i) if len(i)<3 else i for i in date1[3]]
    date1['final_data'] = date1[2]+'/'+date1[1]+'/'+date1[3]
    date1 = date1.reset_index().drop(['match',0],axis=1).set_index('level_0').rename(columns={1:0,2:1,3:2})
    date1 = date1[date1[2]!='308']
    date1['final_date'] = pd.to_datetime(date1.final_data,format='%d/%m/%Y')
    
    date2 = (df.str.extractall('((?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*..(?:\d{,2}..)\d{2,4})'))
    col1,col2,col3 = [],[],[]
    for i,j in enumerate(date2[0].str.strip().str.split(' ')):
        k=0
        if(len(j)==3):
            col1.append(j[k])
            col2.append(j[k+1])
            col3.append(j[k+2])
        else:
            col1.append(1)
            col2.append(j[k])
            col3.append(j[k+1])

    date2_temp = pd.DataFrame(data={'col1':col1,'col2':col2,'col3':col3})

    (date2_temp['col1'].replace(['Feb','October.', 'Jan','Mar.','Sep.', 'Nov','Dec','Oct','September.'],
                                                    ['February','October', 'January','March','September', 'November',
                                                     'December','October','September'],inplace=True))

    (date2_temp['col2'].replace(['July,','May,','January,','February,','Mar,','June,','March,', 'Apr,','Decemeber',
                                'Janaury'],
                               ['July','May','January','February','March','June','March', 'April','December',
                               'January'],inplace=True))

    rows_to_match = ['April', 'May', 'February',
           'October', 'January', 'July', 'December', 'September', 'March',
           'August', 'November', 'June',]

    date2_temp.loc[date2_temp.col1.isin(rows_to_match), 
                   ['col1','col2']] = date2_temp.loc[date2_temp.col1.isin(rows_to_match), ['col2','col1']].values

    date2_temp.col1 = [str(i).replace(',','') for i in date2_temp.col1]
    date2_temp['col1'] = [i.zfill(2) for i in date2_temp['col1'].str.strip()]

    date2_temp['final_data'] = date2_temp['col1']+' '+date2_temp['col2']+' '+date2_temp['col3']
    date2_temp.set_index(date2.index.levels[0],inplace=True)
    date2_temp.rename(columns={'col1':0,'col2':1,'col3':2},inplace=True)
    date2_temp['final_date'] = pd.to_datetime(date2_temp.final_data,infer_datetime_format=True)
    # date2_temp.head()
    
    date3 = (df.str.extractall('((?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]{,5}\W{,2}\d{2,4})'))
    date3_temp = pd.DataFrame(list(date3[0].str.split(' ')))

    date3_temp[0] = [i.replace('.','').replace(',','') for i in date3_temp[0] ]
    date3_temp[1] = [i.replace('.','').replace(',','') for i in date3_temp[1] ]

    date3_temp.loc[date3_temp[2].isnull(),[0,1,2]] = date3_temp.loc[date3_temp[2].isnull(),[2, 0,1]].values

    date3_temp.loc[date3_temp[0]=='',0] = '01'
    date3_temp.loc[date3_temp[0].isnull(),0] = '01'

    date3_temp[1].unique()

    (date3_temp[1].replace(['Feb','Jan', 'Sep','Oct', 'Nov','Mar', 'Aug', 'Dec', 'Jun','Jul', 'Apr','Janaury'],
                        ['February', 'January','September','October', 'November','March','August',
                         'December','June','July','April','January'],inplace=True))
    date3_temp[2] = ['19'+str(i) if len(i)<3 else i for i in date3_temp[2]]


    date3_temp['final_data'] = date3_temp[0]+' '+date3_temp[1]+' '+date3_temp[2]
    date3_temp.set_index(date3.index.levels[0],inplace=True)
    date3_temp['final_date'] = pd.to_datetime(date3_temp.final_data,infer_datetime_format=True)

    # date3_temp.head()
    
    date4 = (df.str.extractall('((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:\d{,2}[sthrdn]*.) \d{4})'))
    date4_temp = pd.DataFrame(list(date4[0].str.split(' ')))

    date4_temp[0] = [i.replace('.','').replace(',','') for i in date4_temp[0] ]
    date4_temp[1] = [i.replace('.','').replace(',','') for i in date4_temp[1] ]

    date4_temp.loc[:,[0,1]] = date4_temp.loc[:,[1,0]].values

    date4_temp[1].replace(['Feb', 'Jan', 'Nov', 'Dec', 'Oct'],
                        ['February', 'January', 'November','December','October'],inplace=True)
    date4_temp['final_data'] = date4_temp[0]+' '+date4_temp[1]+' '+date4_temp[2]
    date4_temp.set_index(date4.index.levels[0],inplace=True)
    date4_temp['final_date'] = pd.to_datetime(date4_temp.final_data,infer_datetime_format=True)
    # date4_temp
    
    date5 = (df.str.extractall('(\d{,2}/\d{4})'))
    temp_index= set(date5.index.levels[0]).difference(date1.index)
    date5 = date5[date5.index.levels[0].isin(list(temp_index))]
    date5_temp = pd.DataFrame(list(date5[0].str.split('/')))

    date5_temp[2] = '01'

    date5_temp[0] = [i.zfill(2) for i in date5_temp[0].str.strip()]

    date5_temp['final_data'] = date5_temp[2]+' '+date5_temp[0]+' '+date5_temp[1]
    date5_temp.set_index(np.array([*temp_index]),inplace=True)
    date5_temp['final_date'] = pd.to_datetime(date5_temp.final_data,format='%d %m %Y')
    # display_max(date5_temp)
    
    date6 = (df.str.extractall('([12]\d{3})'))
    missing_indices = set(date6.index.levels[0]).difference(set(date1.index).union(date2_temp.index).
                                                            union(date3_temp.index).
                                                            union(date4_temp.index).union(date5_temp.index))
    date6_temp = date6.copy()
    date6_temp.reset_index(inplace=True)
    date6_temp.set_index(date6_temp.level_0,inplace=True)
    date6_temp.drop('level_0',axis=1,inplace=True)
    date6_temp=date6_temp.iloc[date6_temp.index.get_indexer([*missing_indices])]
    date6_temp[1] = '01'
    date6_temp[2] = '01'
    date6_temp['final_data'] = date6_temp[1]+' '+date6_temp[2]+' '+date6_temp[0]
    date6_temp.drop(labels='match',axis=1,inplace=True)
    date6_temp['final_date'] = pd.to_datetime(date6_temp.final_data,format='%d %m %Y')
    # display_max(date6_temp)
    
    date7 = (df.str.extractall('((\d{1,2}-)(\d{1,2}-)(\d{1,2}))'))
    date7_temp = date7.copy()
    date7_temp[1] = [i.zfill(2) for i in date7_temp[1].str.replace('-','').str.strip()]
    date7_temp[2] = [i for i in date7_temp[2].str.replace('-','')]
    date7_temp[3] = ['19'+i for i in date7_temp[3].str.strip()]
    date7_temp['final_data'] = date7_temp[2]+' '+date7_temp[1]+' '+date7_temp[3]
    date7_temp.set_index(date7.index.levels[0],inplace=True)
    date7_temp = date7_temp.drop([0],axis=1).rename(columns={1:0,2:1,3:2})
    date7_temp['final_date'] = pd.to_datetime(date7_temp.final_data,format='%d %m %Y')
    # date7_temp
    
    idx_to_match = date4_temp.index
    date3_temp = date3_temp[~date3_temp.isin(idx_to_match)]
    temp_index3 = set(date3_temp.index).difference(set(date1.index).union(date2_temp.index))
    temp_index4 = set(date4_temp.index).difference(set(date1.index).union(date2_temp.index).union(date3_temp.index))
    # temp_index5 = set(date5_temp.index).difference(set(date1.index).union(date2_temp.index).union(date3_temp.index)
    #                                               .union(date4_temp.index))

    final_date3_df = date3_temp.iloc[np.sort(date3_temp.index.get_indexer([*temp_index3]))]
    final_date4_df = date4_temp.iloc[np.sort(date4_temp.index.get_indexer([*temp_index4]))]
    # final_date5_df = date5_temp.iloc[np.sort(date5_temp.index.get_indexer([*temp_index5]))]

    final_df_list = date1,date2_temp,final_date3_df,final_date4_df,date5_temp,date6_temp,date7_temp
    final_df = reduce(lambda x,y: pd.concat([x,y]),final_df_list)
    final_df.sort_index(inplace=True)
#     display_max(final_df)
    
    return np.argsort(final_df['final_date'])# Your answer here

**For the latest version of Pandas use the code below**

In [None]:
def date_sorter():
    
    # Your code here
    date1 = (df.str.extractall('((\d{1,2}/)(\d{1,2}/)(\d{2,4}))'))
    date1[1] = [i.zfill(2) for i in date1[1].str.replace('/','')]
    date1[2] = [i.zfill(2) for i in date1[2].str.replace('/','')]
    date1[3] = ['19'+str(i) if len(i)<3 else i for i in date1[3]]
    date1['final_data'] = date1[2]+'/'+date1[1]+'/'+date1[3]
    date1 = date1.reset_index().drop(['match',0],axis=1).set_index('level_0').rename({1:0,2:1,3:2},axis=1)
    date1 = date1[date1[2]!='308']
    date1['final_date'] = pd.to_datetime(date1.final_data,format='%d/%m/%Y')
    
    date2 = (df.str.extractall('((?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*..(?:\d{,2}..)\d{2,4})'))
    col1,col2,col3 = [],[],[]
    for i,j in enumerate(date2[0].str.strip().str.split(' ')):
        k=0
        if(len(j)==3):
            col1.append(j[k])
            col2.append(j[k+1])
            col3.append(j[k+2])
        else:
            col1.append(1)
            col2.append(j[k])
            col3.append(j[k+1])

    date2_temp = pd.DataFrame(data={'col1':col1,'col2':col2,'col3':col3})

    (date2_temp['col1'].replace(['Feb','October.', 'Jan','Mar.','Sep.', 'Nov','Dec','Oct','September.'],
                                                    ['February','October', 'January','March','September', 'November',
                                                     'December','October','September'],inplace=True))

    (date2_temp['col2'].replace(['July,','May,','January,','February,','Mar,','June,','March,', 'Apr,','Decemeber',
                                'Janaury'],
                               ['July','May','January','February','March','June','March', 'April','December',
                               'January'],inplace=True))

    rows_to_match = ['April', 'May', 'February',
           'October', 'January', 'July', 'December', 'September', 'March',
           'August', 'November', 'June',]

    date2_temp.loc[date2_temp.col1.isin(rows_to_match), 
                   ['col1','col2']] = date2_temp.loc[date2_temp.col1.isin(rows_to_match), ['col2','col1']].values

    date2_temp.col1 = [str(i).replace(',','') for i in date2_temp.col1]
    date2_temp['col1'] = [i.zfill(2) for i in date2_temp['col1'].str.strip()]

    date2_temp['final_data'] = date2_temp['col1']+' '+date2_temp['col2']+' '+date2_temp['col3']
    date2_temp.set_index(date2.index.levels[0],inplace=True)
    date2_temp.rename({'col1':0,'col2':1,'col3':2},axis=1,inplace=True)
    date2_temp['final_date'] = pd.to_datetime(date2_temp.final_data,infer_datetime_format=True)
    # date2_temp.head()
    
    date3 = (df.str.extractall('((?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]{,5}\W{,2}\d{2,4})'))
    date3_temp = pd.DataFrame(date3[0].str.split(' ').to_list())

    date3_temp[0] = [i.replace('.','').replace(',','') for i in date3_temp[0] ]
    date3_temp[1] = [i.replace('.','').replace(',','') for i in date3_temp[1] ]

    date3_temp.loc[date3_temp[2].isna(),[0,1,2]] = date3_temp.loc[date3_temp[2].isna(),[2, 0,1]].values

    date3_temp.loc[date3_temp[0]=='',0] = '01'
    date3_temp.loc[date3_temp[0].isna(),0] = '01'

    date3_temp[1].unique()

    (date3_temp[1].replace(['Feb','Jan', 'Sep','Oct', 'Nov','Mar', 'Aug', 'Dec', 'Jun','Jul', 'Apr','Janaury'],
                        ['February', 'January','September','October', 'November','March','August',
                         'December','June','July','April','January'],inplace=True))
    date3_temp[2] = ['19'+str(i) if len(i)<3 else i for i in date3_temp[2]]


    date3_temp['final_data'] = date3_temp[0]+' '+date3_temp[1]+' '+date3_temp[2]
    date3_temp.set_index(date3.index.levels[0],inplace=True)
    date3_temp['final_date'] = pd.to_datetime(date3_temp.final_data,infer_datetime_format=True)

    # date3_temp.head()
    
    date4 = (df.str.extractall('((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:\d{,2}[sthrdn]*.) \d{4})'))
    date4_temp = pd.DataFrame(date4[0].str.split(' ').to_list())

    date4_temp[0] = [i.replace('.','').replace(',','') for i in date4_temp[0] ]
    date4_temp[1] = [i.replace('.','').replace(',','') for i in date4_temp[1] ]

    date4_temp.loc[:,[0,1]] = date4_temp.loc[:,[1,0]].values

    date4_temp[1].replace(['Feb', 'Jan', 'Nov', 'Dec', 'Oct'],
                        ['February', 'January', 'November','December','October'],inplace=True)
    date4_temp['final_data'] = date4_temp[0]+' '+date4_temp[1]+' '+date4_temp[2]
    date4_temp.set_index(date4.index.levels[0],inplace=True)
    date4_temp['final_date'] = pd.to_datetime(date4_temp.final_data,infer_datetime_format=True)
    # date4_temp
    
    date5 = (df.str.extractall('(\d{,2}/\d{4})'))
    temp_index= set(date5.index.levels[0]).difference(date1.index)
    date5 = date5[date5.index.levels[0].isin(list(temp_index))]
    date5_temp = pd.DataFrame(date5[0].str.split('/').to_list())

    date5_temp[2] = '01'

    date5_temp[0] = [i.zfill(2) for i in date5_temp[0].str.strip()]

    date5_temp['final_data'] = date5_temp[2]+' '+date5_temp[0]+' '+date5_temp[1]
    date5_temp.set_index(np.array([*temp_index]),inplace=True)
    date5_temp['final_date'] = pd.to_datetime(date5_temp.final_data,format='%d %m %Y')
    
    date6 = (df.str.extractall('([12]\d{3})'))
    missing_indices = set(date6.index.levels[0]).difference(set(date1.index).union(date2_temp.index).
                                                            union(date3_temp.index).
                                                            union(date4_temp.index).union(date5_temp.index))
    date6_temp = date6.copy()
    date6_temp.reset_index(inplace=True)
    date6_temp.set_index(date6_temp.level_0,inplace=True)
    date6_temp.drop('level_0',axis=1,inplace=True)
    date6_temp=date6_temp.iloc[date6_temp.index.get_indexer([*missing_indices])]
    date6_temp[1] = '01'
    date6_temp[2] = '01'
    date6_temp['final_data'] = date6_temp[1]+' '+date6_temp[2]+' '+date6_temp[0]
    date6_temp.drop(labels='match',axis=1,inplace=True)
    date6_temp['final_date'] = pd.to_datetime(date6_temp.final_data,format='%d %m %Y')
    # display_max(date6_temp)
    
    date7 = (df.str.extractall('((\d{1,2}-)(\d{1,2}-)(\d{1,2}))'))
    date7_temp = date7.copy()
    date7_temp[1] = [i.zfill(2) for i in date7_temp[1].str.replace('-','').str.strip()]
    date7_temp[2] = [i for i in date7_temp[2].str.replace('-','')]
    date7_temp[3] = ['19'+i for i in date7_temp[3].str.strip()]
    date7_temp['final_data'] = date7_temp[2]+' '+date7_temp[1]+' '+date7_temp[3]
    date7_temp.set_index(date7.index.levels[0],inplace=True)
    date7_temp = date7_temp.drop([0],axis=1).rename({1:0,2:1,3:2},axis=1)
    date7_temp['final_date'] = pd.to_datetime(date7_temp.final_data,format='%d %m %Y')
    # date7_temp
    
    idx_to_match = date4_temp.index
    date3_temp = date3_temp[~date3_temp.isin(idx_to_match)]
    temp_index3 = set(date3_temp.index).difference(set(date1.index).union(date2_temp.index))
    temp_index4 = set(date4_temp.index).difference(set(date1.index).union(date2_temp.index).union(date3_temp.index))
    # temp_index5 = set(date5_temp.index).difference(set(date1.index).union(date2_temp.index).union(date3_temp.index)
    #                                               .union(date4_temp.index))

    final_date3_df = date3_temp.iloc[np.sort(date3_temp.index.get_indexer([*temp_index3]))]
    final_date4_df = date4_temp.iloc[np.sort(date4_temp.index.get_indexer([*temp_index4]))]
    # final_date5_df = date5_temp.iloc[np.sort(date5_temp.index.get_indexer([*temp_index5]))]

    final_df_list = date1,date2_temp,final_date3_df,final_date4_df,date5_temp,date6_temp,date7_temp
    final_df = reduce(lambda x,y: pd.concat([x,y]),final_df_list)
    final_df.sort_index(inplace=True)
    # display_max(final_df)

    return np.argsort(final_df['final_date'])# Your answer here

In [None]:
display_max(date_sorter())

**Individual program components**

In [41]:
def display_max(df):
    with pd.option_context('display.max_rows',1000,'display.max_columns',1000):
        display(df)

In [460]:
date1 = (df.str.extractall('((\d{1,2}/)(\d{1,2}/)(\d{2,4}))'))

date1[1] = [i.zfill(2) for i in date1[1].str.replace('/','')]
date1[2] = [i.zfill(2) for i in date1[2].str.replace('/','')]
date1[3] = ['19'+str(i) if len(i)<3 else i for i in date1[3]]

date1['final_data'] = date1[2]+'/'+date1[1]+'/'+date1[3]
date1 = date1.reset_index().drop(['match',0],axis=1).set_index('level_0').rename({1:0,2:1,3:2},axis=1)
date1 = date1[date1[2]!='308']
date1['final_date'] = pd.to_datetime(date1.final_data,format='%d/%m/%Y')
# date1.head()

In [462]:
date2 = (df.str.extractall('((?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*..(?:\d{,2}..)\d{2,4})'))

col1,col2,col3 = [],[],[]
for i,j in enumerate(date2[0].str.strip().str.split(' ')):
    k=0
    if(len(j)==3):
        col1.append(j[k])
        col2.append(j[k+1])
        col3.append(j[k+2])
    else:
        col1.append(1)
        col2.append(j[k])
        col3.append(j[k+1])
            

date2_temp = pd.DataFrame(data={'col1':col1,'col2':col2,'col3':col3})

(date2_temp['col1'].replace(['Feb','October.', 'Jan','Mar.','Sep.', 'Nov','Dec','Oct','September.'],
                                                ['February','October', 'January','March','September', 'November',
                                                 'December','October','September'],inplace=True))

(date2_temp['col2'].replace(['July,','May,','January,','February,','Mar,','June,','March,', 'Apr,','Decemeber',
                            'Janaury'],
                           ['July','May','January','February','March','June','March', 'April','December',
                           'January'],inplace=True))

rows_to_match = ['April', 'May', 'February',
       'October', 'January', 'July', 'December', 'September', 'March',
       'August', 'November', 'June',]

date2_temp.loc[date2_temp.col1.isin(rows_to_match), 
               ['col1','col2']] = date2_temp.loc[date2_temp.col1.isin(rows_to_match), ['col2','col1']].values

date2_temp.col1 = [str(i).replace(',','') for i in date2_temp.col1]
date2_temp['col1'] = [i.zfill(2) for i in date2_temp['col1'].str.strip()]

date2_temp['final_data'] = date2_temp['col1']+' '+date2_temp['col2']+' '+date2_temp['col3']
date2_temp.set_index(date2.index.levels[0],inplace=True)
date2_temp.rename({'col1':0,'col2':1,'col3':2},axis=1,inplace=True)
date2_temp['final_date'] = pd.to_datetime(date2_temp.final_data,infer_datetime_format=True)
# date2_temp.head()

In [None]:
display_max(date2_temp)

In [21]:
date3 = (df.str.extractall('((?:\d{,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]{,5}\W{,2}\d{2,4})'))

date3_temp = pd.DataFrame(date3[0].str.split(' ').to_list())

date3_temp[0] = [i.replace('.','').replace(',','') for i in date3_temp[0] ]
date3_temp[1] = [i.replace('.','').replace(',','') for i in date3_temp[1] ]

date3_temp.loc[date3_temp[2].isna(),[0,1,2]] = date3_temp.loc[date3_temp[2].isna(),[2, 0,1]].values

date3_temp.loc[date3_temp[0]=='',0] = '01'
date3_temp.loc[date3_temp[0].isna(),0] = '01'

date3_temp[1].unique()

(date3_temp[1].replace(['Feb','Jan', 'Sep','Oct', 'Nov','Mar', 'Aug', 'Dec', 'Jun','Jul', 'Apr','Janaury'],
                    ['February', 'January','September','October', 'November','March','August',
                     'December','June','July','April','January'],inplace=True))
date3_temp[2] = ['19'+str(i) if len(i)<3 else i for i in date3_temp[2]]


date3_temp['final_data'] = date3_temp[0]+' '+date3_temp[1]+' '+date3_temp[2]
date3_temp.set_index(date3.index.levels[0],inplace=True)
date3_temp['final_date'] = pd.to_datetime(date3_temp.final_data,infer_datetime_format=True)

# date3_temp.head()

In [476]:
date4 = (df.str.extractall('((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?:\d{,2}[sthrdn]*.) \d{4})'))

date4_temp = pd.DataFrame(date4[0].str.split(' ').to_list())

date4_temp[0] = [i.replace('.','').replace(',','') for i in date4_temp[0] ]
date4_temp[1] = [i.replace('.','').replace(',','') for i in date4_temp[1] ]

date4_temp.loc[:,[0,1]] = date4_temp.loc[:,[1,0]].values

date4_temp[1].replace(['Feb', 'Jan', 'Nov', 'Dec', 'Oct'],
                    ['February', 'January', 'November','December','October'],inplace=True)
date4_temp['final_data'] = date4_temp[0]+' '+date4_temp[1]+' '+date4_temp[2]
date4_temp.set_index(date4.index.levels[0],inplace=True)
date4_temp['final_date'] = pd.to_datetime(date4_temp.final_data,infer_datetime_format=True)
# date4_temp

In [615]:
date5 = (df.str.extractall('(\d{,2}/\d{4})'))
temp_index= set(date5.index.levels[0]).difference(date1.index)
date5 = date5[date5.index.levels[0].isin(list(temp_index))]
date5_temp = pd.DataFrame(date5[0].str.split('/').to_list())

date5_temp[2] = '01'

date5_temp[0] = [i.zfill(2) for i in date5_temp[0].str.strip()]

date5_temp['final_data'] = date5_temp[2]+' '+date5_temp[0]+' '+date5_temp[1]
date5_temp.set_index(np.array([*temp_index]),inplace=True)
date5_temp['final_date'] = pd.to_datetime(date5_temp.final_data,format='%d %m %Y')
# display_max(date5_temp)

In [43]:
date6 = (df.str.extractall('([12]\d{3})'))

missing_indices = set(date6.index.levels[0]).difference(set(date1.index).union(date2_temp.index).
                                                        union(date3_temp.index).
                                                        union(date4_temp.index).union(date5_temp.index))

date6_temp = date6.copy()
date6_temp.reset_index(inplace=True)
date6_temp.set_index(date6_temp.level_0,inplace=True)

date6_temp.drop('level_0',axis=1,inplace=True)

date6_temp=date6_temp.iloc[date6_temp.index.get_indexer([*missing_indices])]
date6_temp[1] = '01'
date6_temp[2] = '01'
date6_temp['final_data'] = date6_temp[1]+' '+date6_temp[2]+' '+date6_temp[0]
date6_temp.drop(labels='match',axis=1,inplace=True)
date6_temp['final_date'] = pd.to_datetime(date6_temp.final_data,format='%d %m %Y')
# display_max(date6_temp)

In [170]:
date7 = (df.str.extractall('((\d{1,2}-)(\d{1,2}-)(\d{1,2}))'))

date7_temp = date7.copy()

date7_temp[1] = [i.zfill(2) for i in date7_temp[1].str.replace('-','').str.strip()]
date7_temp[2] = [i for i in date7_temp[2].str.replace('-','')]
date7_temp[3] = ['19'+i for i in date7_temp[3].str.strip()]
date7_temp['final_data'] = date7_temp[2]+' '+date7_temp[1]+' '+date7_temp[3]
date7_temp.set_index(date7.index.levels[0],inplace=True)
date7_temp = date7_temp.drop([0],axis=1).rename({1:0,2:1,3:2},axis=1)
date7_temp['final_date'] = pd.to_datetime(date7_temp.final_data,format='%d %m %Y')
# date7_temp

In [507]:
idx_to_match = date4_temp.index
date3_temp = date3_temp[~date3_temp.isin(idx_to_match)]
temp_index3 = set(date3_temp.index).difference(set(date1.index).union(date2_temp.index))
temp_index4 = set(date4_temp.index).difference(set(date1.index).union(date2_temp.index).union(date3_temp.index))
# temp_index5 = set(date5_temp.index).difference(set(date1.index).union(date2_temp.index).union(date3_temp.index)
#                                               .union(date4_temp.index))

final_date3_df = date3_temp.iloc[np.sort(date3_temp.index.get_indexer([*temp_index3]))]
final_date4_df = date4_temp.iloc[np.sort(date4_temp.index.get_indexer([*temp_index4]))]
# final_date5_df = date5_temp.iloc[np.sort(date5_temp.index.get_indexer([*temp_index5]))]

final_df_list = date1,date2_temp,final_date3_df,final_date4_df,date5_temp,date6_temp,date7_temp
final_df = reduce(lambda x,y: pd.concat([x,y]),final_df_list)
final_df.sort_index(inplace=True)
# display_max(final_df)

In [510]:
np.argsort(final_df['final_date'])

0        9
1       84
2        2
3       53
4       28
      ... 
495    253
496    231
497    141
498    186
499    161
Name: final_date, Length: 500, dtype: int64

To find duplicate indices

In [425]:
date7_temp.index[date7_temp.index.duplicated()]

Int64Index([], dtype='int64')

# Week 2

[Back on top](#TOC)

## Basic NLP Tasks with NLTK

In [5]:
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [94]:
nltk.download() # downloads all the texts in nltk package

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [41]:
texts() # prints all the texts

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [40]:
sents() # prints one sentence from each text

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


### Counting vocabulary of words

In [38]:
text7

<Text: Wall Street Journal>

In [45]:
sent7 # print one sentence from the Wall Street Journal text

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [46]:
len(sent7)

18

In [47]:
len(text7) # all the words in text7; duplicates included

100676

In [48]:
len(set(text7)) # all the words in text7; duplicates excluded

12408

In [49]:
list(set(text7))[:10]

['sweeping',
 'addiction',
 'excision',
 'Ill.',
 'erasures',
 'smallest',
 'concentration',
 'Piper',
 'Examiner',
 'craft']

### Frequency of words

In [50]:
dist = FreqDist(text7) # this gives the frequency of each word in text7
len(dist)

12408

In [51]:
vocab1 = dist.keys()
#vocab1[:10] 
# In Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [52]:
dist['four'] # how many times does the word 'four' occur 

20

In [53]:
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100] # number of words > 5 characters long appearing > 100 times
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

### Normalization and stemming

**Normalisation** <br>
Transform a word to make it appear the same way eg. all lower/uppercase, any transformations applied to all the words

In [22]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

**Stemming**<br>
Find the root word

In [23]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

### Lemmatization

**Lemmatization**<br>
*Stemming* doesn't always return a valid word but *lemmatization* does eg.Stemming = Univers, lemmatization=Universe

In [65]:
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [24]:
[porter.stem(t) for t in udhr[:20]] # Still Lemmatization

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [25]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

### Tokenization

In [28]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [29]:
nltk.word_tokenize(text11) # find all the tokens

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [30]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12) # finds all the sentences
len(sentences)

4

In [31]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

## Advanced NLP Tasks with NLTK

### POS tagging

**Part of speech(POS) tagging**<br>
eg. nouns, adjectives, prepositions, conjunctions etc.


In [69]:
nltk.help.upenn_tagset('MD') # there are many other tags for POS and here MD is for modal auxilliary; NN = Noun,
# RB = Adverb, VBG=Verb Gerand form, DT=Determiner(a,the), JJ=Adjective, etc.

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [34]:
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

In [35]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

In [37]:
# Parsing sentence structure
text15 = nltk.word_tokenize("Alice loves Bob")
# we have Context Free Grammar(CFG) below defined as a string format. So the sentence(S) can have two possibilities
# a Noun Phrase(NP) and a Verb Phrase(VP); VP can have Verb(V) or NP; below we have all the possibilities of NP, VP and V
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")
# Once we define the CFG, we can use it to parse the sentence, which will return trees. 
parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


In [9]:
import urllib.parse
text16 = nltk.word_tokenize("I saw the man with a telescope")
# grammar1 = nltk.data.load(resource_url=urllib.parse.quote(path)+'mygrammar.cfg')
# one can also use the entire path as indicated below
grammar1 = nltk.data.load(resource_url=r'file://D:/Coursera/Applied Data Science with Python'+\
                          '/Applied Text Mining/Notebooks/mygrammar.cfg')
grammar1
# grammar2learner = nltk.CFG.fromstring("""
# S -> NP VP
# VP -> V NP | VP PP
# PP -> P NP
# NP -> DT N | DT N PP | 'I'
# DT -> 'a' | 'the'
# N -> 'man' | 'telescope'
# V -> 'saw'
# P -> 'with'
# """)
# print(grammar2learner)

<Grammar with 13 productions>

In [31]:
parser = nltk.ChartParser(grammar1)
trees = parser.parse_all(text16)
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (DT the) (N man)))
    (PP (P with) (NP (DT a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (DT the) (N man) (PP (P with) (NP (DT a) (N telescope))))))


In [81]:
from nltk.corpus import treebank # a collection of parsed trees from WSJ
text17 = treebank.parsed_sents('wsj_0001.mrg')[0] # parsing the first sentence of the WSJ text
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


### POS tagging and parsing ambiguity

In [43]:
text18 = nltk.word_tokenize("The old man the boat") # there's limitation to POS tagging as the man in this case is a verb
# however the POS tagging considers it as a noun
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [44]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously") # the sentence doesn't make any sense meaning wise
# in POS tagging Colorless is considered as a Proper Noun(NNP) but it's an adjective(JJ) instead. Even if you remove 
# colorless from the sentence the POS tagging is correct but the sentence still doesn't make sense.
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

## Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

### Part 1 - Analyzing Moby Dick

In [None]:
import nltk
from nltk.book import *
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

#### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [None]:
def example_one():
    
    return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

example_one()

#### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [None]:
def example_two():
    
    return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

example_two()

#### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [None]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

#### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [None]:
def answer_one():
    
    
    return example_two()/example_one()# Your answer here

answer_one()

#### Question 2

What percentage of tokens is 'whale'or 'Whale'?

*This function should return a float.*

In [None]:
def answer_two():
    
    
    return sum([(j) for i,j in list(FreqDist(text1).items()) if i =='whale' or i=='Whale'])/example_one()*100# Your answer here

answer_two()

#### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [None]:
def answer_three():
    
    
    return sorted([(i,j) for i,j in list(FreqDist([k for k in moby_tokens]).items())
                  ],key=lambda x:x[1],reverse=True)[:20]# Your answer here

answer_three()

#### Question 4

What tokens have a length of greater than 5 and frequency of more than 150?

*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [None]:
def answer_four():
    
    freq=FreqDist(moby_tokens)
    return sorted([i for i in freq.keys() if len(i) > 5 and freq[i] > 150])# Your answer here

answer_four()

#### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [None]:
def answer_five():
    
    
    return max(moby_tokens,key=len),len(max(moby_tokens,key=len))# Your answer here

answer_five()

#### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [None]:
def answer_six():
    
    freq=FreqDist(moby_tokens)
    return sorted([(freq[i],i) for i in freq.keys() if i.isalpha()==True and freq[i] > 2000],reverse=True)# Your answer here

answer_six()

#### Question 7

What is the average number of tokens per sentence?

*This function should return a float.*

In [None]:
def answer_seven():
    
    
    return np.mean([len(nltk.word_tokenize(i)) for i in nltk.sent_tokenize(moby_raw)])# Your answer here

answer_seven()

#### Question 8

What are the 5 most frequent parts of speech in this text? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [None]:
def answer_eight():
    
    
    return sorted([(j,k) for j,k in FreqDist([i[1] for i in 
                                              (nltk.pos_tag(moby_tokens))]).items()],key=lambda x: x[1],reverse=True)[:5]# Your answer here

answer_eight()

### Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [None]:
from nltk.corpus import words

correct_spellings = words.words()

#### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [None]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    recommendations = []
    for i in entries:
        correct_spelling_temp = [ j for j in correct_spellings if j.startswith(i[0]) and len(j)>3]
        distance_calc = [(j,nltk.jaccard_distance(set(nltk.ngrams(i,n=3)),set(nltk.ngrams(j,n=3))))
                         for j in correct_spelling_temp]
        recommendations.append(sorted(distance_calc,key=lambda x:x[1])[0][0])
    
    return recommendations # Your answer here
    
answer_nine()

#### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [None]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    recommendations = []
    for i in entries:
        correct_spelling_temp = [ j for j in correct_spellings if j.startswith(i[0]) and len(j)>3]
        distance_calc = [(j,nltk.jaccard_distance(set(nltk.ngrams(i,n=4)),set(nltk.ngrams(j,n=4))))
                         for j in correct_spelling_temp]
        recommendations.append(sorted(distance_calc,key=lambda x:x[1])[0][0])
    
    return recommendations # Your answer here
    
answer_ten()

#### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [None]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    recommendations = []
    for i in entries:
        correct_spelling_temp = [ j for j in correct_spellings if j.startswith(i[0]) and len(j)>3]
        distance_calc = [(j,nltk.edit_distance(i,j,transpositions=True))for j in correct_spelling_temp]
        recommendations.append(sorted(distance_calc,key=lambda x:x[1])[0][0])
    
    return recommendations # Your answer here 
    
answer_eleven()

# Week 3

[Back on top](#TOC)

## Case Study: Sentiment Analysis

### Data Prep

In [11]:
import pandas as pd,numpy as np,zipfile
# # create a zipfile
# zf = zipfile.ZipFile(path+'amazon.zip', 'w')
# # write the file in the zip and compress it 
# zf.write('Amazon_Unlocked_Mobile.csv', compress_type=zipfile.ZIP_DEFLATED)
# zf.close()
# create a zip object so that we can access the files in it
# if you are working on a linux system then you can generate the compressed file in the below manner. 
# first check the path of the file 
# !ls -l Amazon_Unlocked_Mobile.csv
# archive the file by keying in the path and the file; as this is the path in coursera server we have to put the path 
# of the csv file
# !tar zcvf Amazon_Unlocked_Mobile.csv.tar.gz /home/jovyan/work-ro/Amazon_Unlocked_Mobile.csv
# check the file that you have created in the earlier steo
# !ls -l Amazon_Unlocked_Mobile.csv.tar.gz

zf = zipfile.ZipFile(path+'amazon.zip') 

# Read in the data
df = pd.read_csv(zf.open('Amazon_Unlocked_Mobile.csv'))
# df = pd.read_csv(,sep=',')

# Sample the data to speed up computation
# Comment out this line to match with lecture
# df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [12]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [13]:
# Most ratings are positive
df['Positively Rated'].mean()

0.7482686025879323

In [14]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [15]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 I bought a BB Black and was deliveried a White BB.Really is not a serious provider...Next time is better to cancel the order.


X_train shape:  (231207,)


### CountVectorizer

In [88]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [89]:
vect.get_feature_names()[::2000]

['00',
 '4less',
 'adr6275',
 'assignment',
 'blazingly',
 'cassettes',
 'condishion',
 'debi',
 'dollarsshipping',
 'esteem',
 'flashy',
 'gorila',
 'human',
 'irullu',
 'like',
 'microsaudered',
 'nightmarish',
 'p770',
 'poori',
 'quirky',
 'responseive',
 'send',
 'sos',
 'synch',
 'trace',
 'utiles',
 'withstanding']

In [90]:
len(vect.get_feature_names())

53216

In [91]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [92]:
from sklearn.linear_model import LogisticRegression

# Train the model
# lbfgs has been used instead of liblinear as liblinear won't be the default solver going forward
model = LogisticRegression(solver='lbfgs',max_iter=10e6,n_jobs=-1)
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000000.0,
                   multi_class='warn', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [102]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict_proba(vect.transform(X_test))
# using predicted class labels is the incorrect way of calculating roc_auc_score, instead we have to use 
# Target scores, can either be probability estimates of the positive class, confidence values, 
# or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)
# as this is a binary classification we have only two columns from the predict_proba; class 0 and class 1
print('AUC: ', roc_auc_score(y_test, predictions[:,1]))

AUC:  0.9795208210294712


In [103]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model; argsort is ascending
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['mony' 'worst' 'false' 'worthless' 'horribly' 'messing' 'unsatisfied'
 'blacklist' 'junk' 'superthin']

Largest Coefs: 
['excelent' 'excelente' '4eeeks' 'exelente' 'efficient' 'excellent'
 'loving' 'pleasantly' 'loves' 'mn8k2ll']


### Tfidf

In [104]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

17951

In [106]:
X_train_vectorized = vect.transform(X_train)
# lbfgs has been used instead of liblinear as liblinear won't be the default solver going forward
model = LogisticRegression(solver='lbfgs',max_iter=10e6,n_jobs=-1)
model.fit(X_train_vectorized, y_train)

predictions = model.predict_proba(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions[:,1]))

AUC:  0.9821692343323402


In [107]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Largest tfidf: 
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']


In [108]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [109]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


### n-grams

In [110]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

198917

In [None]:
# lbfgs has been used instead of liblinear as liblinear won't be the default solver going forward
model = LogisticRegression(solver='lbfgs',max_iter=10e6,n_jobs=-1)
model.fit(X_train_vectorized, y_train)

In [115]:
predictions = model.predict_proba(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions[:,1]))

AUC:  0.9900388254210035


In [112]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'worst' 'junk' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'looks ok' 'nope']

Largest Coefs: 
['not bad' 'excelent' 'excelente' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'no issues' 'great']


In [113]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


## Assignment 3

In this assignment you will explore text message data and create models to predict if a message is spam or not. 

In [17]:
cd D:\Coursera\Applied Data Science with Python\Applied Text Mining\Notebooks

D:\Coursera\Applied Data Science with Python\Applied Text Mining\Notebooks


In [18]:
import pandas as pd, numpy as np

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [20]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

### Question 1
What percentage of the documents in `spam_data` are spam?

*This function should return a float, the percent value (i.e. $ratio * 100$).*

In [221]:
def answer_one():
    
    
    return spam_data['target'].mean()*100#Your answer here

In [222]:
answer_one()

13.406317300789663

### Question 2

Fit the training data `X_train` using a Count Vectorizer with default parameters.

What is the longest token in the vocabulary?

*This function should return a string.*

In [126]:
from sklearn.feature_extraction.text import CountVectorizer

def answer_two():
    vect = CountVectorizer().fit(X_train)
    
    
    return max(vect.get_feature_names(),key=len)#Your answer here

In [24]:
answer_two()

'com1win150ppmx3age16subscription'

### Question 3

Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.

Next, fit a fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. Find the area under the curve (AUC) score using the transformed test data.

*This function should return the AUC score as a float.*

In [43]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

def answer_three():
    cvclass = CountVectorizer()
    model_vectorised = cvclass.fit_transform(X_train)
    mnb = MultinomialNB(alpha=0.1).fit(model_vectorised,y_train)
    predictions = mnb.predict(cvclass.transform(X_test))
# using predicted class labesls is the incorrect way of calculating roc_auc_score, instead we have to use 
# Target scores, can either be probability estimates of the positive class, confidence values, 
# or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)
    
    return roc_auc_score(y_test,predictions)#Your answer here

In [44]:
answer_three()

0.9720812182741116

### Question 4

Fit and transform the training data `X_train` using a Tfidf Vectorizer with default parameters.

What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?

Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name. The index of the series should be the feature name, and the data should be the tf-idf.

The series of 20 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 20 features with largest tf-idfs should be sorted largest first. 

*This function should return a tuple of two series
`(smallest tf-idfs series, largest tf-idfs series)`.*

In [215]:
from sklearn.feature_extraction.text import TfidfVectorizer

def answer_four():
    tfidf_model = TfidfVectorizer().fit(X_train)
    model_vectorised = tfidf_model.transform(X_train)
    sorted_tfidf_index = model_vectorised.max(axis=0).toarray()[0].argsort()
    sorted_tfidf_data = np.sort(model_vectorised.max(axis=0).toarray()[0])
    feature_names = np.array(tfidf_model.get_feature_names())
    smallest_series = pd.Series(data=sorted_tfidf_data[:20].tolist(),
                                   index=feature_names[sorted_tfidf_index[:20]].tolist()).sort_values(ascending=[True])
    largest_series = pd.Series(data=sorted_tfidf_data[:-21:-1].tolist(),
                                       index=feature_names[sorted_tfidf_index[:-21:-1]].tolist()).sort_values(
        ascending=[True])
    return tuple((smallest_series,largest_series))
#     return (feature_names[sorted_tfidf_index[:-21:-1]].tolist(),feature_names[sorted_tfidf_index[:20]].tolist())
#Your answer here

In [216]:
answer_four()

(sympathetic     0.074475
 venaam          0.074475
 pudunga         0.074475
 organizer       0.074475
 psychologist    0.074475
 stylist         0.074475
 courageous      0.074475
 chef            0.074475
 determined      0.074475
 pest            0.074475
 psychiatrist    0.074475
 exterminator    0.074475
 athletic        0.074475
 listener        0.074475
 companion       0.074475
 dependable      0.074475
 aaniye          0.074475
 healer          0.074475
 diwali          0.091250
 mornings        0.091250
 dtype: float64, blank        0.932702
 tick         0.980166
 645          1.000000
 done         1.000000
 too          1.000000
 anytime      1.000000
 beerage      1.000000
 where        1.000000
 ok           1.000000
 thank        1.000000
 146tf150p    1.000000
 lei          1.000000
 anything     1.000000
 er           1.000000
 thanx        1.000000
 okie         1.000000
 home         1.000000
 havent       1.000000
 nite         1.000000
 yup          1.000000
 dty

### Question 5

Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.

Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1` and compute the area under the curve (AUC) score using the transformed test data.

*This function should return the AUC score as a float.*

In [119]:
def answer_five():
    tfidf_model = TfidfVectorizer(min_df=3).fit(X_train)
    model_vectorised = tfidf_model.transform(X_train)
    mnb = MultinomialNB(alpha=0.1).fit(model_vectorised,y_train)
    predictions = mnb.predict(tfidf_model.transform(X_test))
# using predicted class labesls is the incorrect way of calculating roc_auc_score, instead we have to use 
# Target scores, can either be probability estimates of the positive class, confidence values, 
# or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)

    return roc_auc_score(y_test,predictions)#Your answer here

In [121]:
answer_five()

0.9416243654822335

### Question 6

What is the average length of documents (number of characters) for not spam and spam documents?

*This function should return a tuple (average length not spam, average length spam).*

In [144]:
def answer_six():
    
    
    return (np.mean(list(map(len,list(spam_data.loc[spam_data['target']!=1,'text'])))),
            np.mean(list(map(len,list(spam_data.loc[spam_data['target']==1,'text'])))))#Your answer here

In [145]:
answer_six()

(71.02362694300518, 138.8661311914324)

<br>
<br>
The following function has been provided to help you combine new features into the training data:

In [148]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

### Question 7

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5**.

Using this document-term matrix and an additional feature, **the length of document (number of characters)**, fit a Support Vector Classification model with regularization `C=10000`. Then compute the area under the curve (AUC) score using the transformed test data.

*This function should return the AUC score as a float.*

In [225]:
from sklearn.svm import SVC

def answer_seven():
    tfidf_model = TfidfVectorizer(min_df=5).fit(X_train)
    model_vectorised = tfidf_model.transform(X_train)
    model_vectorised_len = add_feature(model_vectorised,X_train.str.len())
    X_test_vectorised = add_feature(tfidf_model.transform(X_test),X_test.str.len())
    svmclf = SVC(C=10000,gamma='auto').fit(model_vectorised_len,y_train)
    predictions = svmclf.predict(X_test_vectorised)
# using predicted class labesls is the incorrect way of calculating roc_auc_score, instead we have to use 
# Target scores, can either be probability estimates of the positive class, confidence values, 
# or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)

    return roc_auc_score(y_test,predictions) #Your answer here

In [226]:
answer_seven()

0.9581366823421557

### Question 8

What is the average number of digits per document for not spam and spam documents?

*This function should return a tuple (average # digits not spam, average # digits spam).*

In [180]:
def answer_eight():
    spam_data['digit'] = spam_data['text'].apply(lambda x: sum([1 for i in list(x) if i.isdigit()]))
    
    return np.mean(spam_data['digit'][spam_data['target']!=1]),np.mean(spam_data['digit'][spam_data['target']==1]) 
#Your answer here

In [181]:
answer_eight()

(0.2992746113989637, 15.759036144578314)

### Question 9

Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **word n-grams from n=1 to n=3** (unigrams, bigrams, and trigrams).

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* **number of digits per document**

fit a Logistic Regression model with regularization `C=100`. Then compute the area under the curve (AUC) score using the transformed test data.

*This function should return the AUC score as a float.*

In [241]:
from sklearn.linear_model import LogisticRegression

def answer_nine():
    tfidf_model = TfidfVectorizer(min_df=5,ngram_range=[1,3]).fit(X_train)
    model_vectorised = tfidf_model.transform(X_train)
    model_vectorised_len = add_feature(model_vectorised,[X_train.str.len(),X_train.
                                                         apply(lambda x: sum([1 for i in list(x) if i.isdigit()]))])
    X_test_vectorised = add_feature(tfidf_model.transform(X_test),[X_test.str.len(),X_test.
                                                                   apply(lambda x: sum([1 for i in list(x) 
                                                                                        if i.isdigit()]))])
    logreg = LogisticRegression(C=100,solver='liblinear',max_iter=10e6).fit(model_vectorised_len,y_train)
    predictions = logreg.predict(X_test_vectorised)
# using predicted class labesls is the incorrect way of calculating roc_auc_score, instead we have to use 
# Target scores, can either be probability estimates of the positive class, confidence values, 
# or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)

    return roc_auc_score(y_test,predictions)#Your answer here

In [242]:
answer_nine()

0.9653328353394565

### Question 10

What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?

*Hint: Use `\w` and `\W` character classes*

*This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).*

In [201]:
def answer_ten():
    spam_data['digit'] = spam_data['text'].str.findall(r'\W').str.len()
    
    return np.mean(spam_data['digit'][spam_data['target']!=1]),np.mean(spam_data['digit'][spam_data['target']==1])

In [202]:
answer_ten()

(17.29181347150259, 29.041499330655956)

### Question 11

Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**

To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* number of digits per document
* **number of non-word characters (anything other than a letter, digit or underscore.)**

fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.

Also **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.

The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.

The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients:
['length_of_doc', 'digit_count', 'non_word_char_count']

*This function should return a tuple `(AUC score as a float, smallest coefs list, largest coefs list)`.*

In [243]:
def answer_eleven():
    cv_model = CountVectorizer(min_df=5,ngram_range=[2,5],analyzer='char_wb').fit(X_train)
    model_vectorised = cv_model.transform(X_train)
    model_vectorised_len = add_feature(model_vectorised,[X_train.str.len(),X_train.
                                                         apply(lambda x: sum([1 for i in list(x) if i.isdigit()])),
                                                        X_train.str.findall(r'\W').str.len()])
    X_test_vectorised = add_feature(cv_model.transform(X_test),[X_test.str.len(),X_test.
                                                                   apply(lambda x: sum([1 for i in list(x) 
                                                                                        if i.isdigit()])),
                                                                  X_test.str.findall(r'\W').str.len()])
    logreg = LogisticRegression(C=100,solver='liblinear',max_iter=10e6).fit(model_vectorised_len,y_train)
    predictions = logreg.predict(X_test_vectorised)
    feature_names = np.array(cv_model.get_feature_names() + ['length_of_doc', 'digit_count', 'non_word_char_count'])
    sorted_coef_index = logreg.coef_[0].argsort()
# using predicted class labesls is the incorrect way of calculating roc_auc_score, instead we have to use 
# Target scores, can either be probability estimates of the positive class, confidence values, 
# or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)

    return (roc_auc_score(y_test,predictions),feature_names[sorted_coef_index[:10]].tolist(),
            feature_names[sorted_coef_index[:-11:-1]].tolist())#Your answer here

In [244]:
answer_eleven()

(0.9780231906694056,
 array([' i', 'ca', '..', '. ', 'pe', ' go', ' m', 'if', 'us', 'go'],
       dtype='<U19'),
 array(['digit_count', 'ia', ' r', 'xt', 'ne', 'co', ' ba', ' x', 'ian ',
        '46'], dtype='<U19'))

# Week 4

[Back on top](#TOC)

In [13]:
# Find appropriate sense of words 
import nltk
from nltk.corpus import wordnet as wn, wordnet_ic
from nltk.collocations import *
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [11]:
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')
# find the path similarity
print('path similarity between deer and elk:',deer.path_similarity(elk))
print('path similarity between deer and horse:',deer.path_similarity(horse))
# lin similarity
brown_ic = wordnet_ic.ic('ic-brown.dat')
print('lin similarity between deer and elk:',deer.lin_similarity(elk,brown_ic))
print('lin similarity between deer and horse:',deer.lin_similarity(horse,brown_ic))

path similarity between deer and elk: 0.5
path similarity between deer and horse: 0.14285714285714285
lin similarity between deer and elk: 0.8623778273893673
lin similarity between deer and horse: 0.7726998936065773


In [21]:
# from collocations we are finding out the word pairs which are the best measures
bigram_measures = nltk.collocations.BigramAssocMeasures()
# reading the words from text1 i.e. from WSJ text in nltk book 
finder = BigramCollocationFinder.from_words(text1)
# applying filter to the list of word pairs indicating that the words have to appear at least a # of times before they can 
# be analysed, in our case the filter is set to 10
finder.apply_freq_filter(10)
# we are finding the n best bigram measures through pointwise mutual information(pmi), in our case 10 best bigram measures
finder.nbest(bigram_measures.pmi,10)

[('.*', '*'),
 ('Cape', 'Horn'),
 ('New', 'Bedford'),
 ('Moby', 'Dick'),
 ('she', 'blows'),
 (",'", 'says'),
 (".'", '"\''),
 ('chief', 'mate'),
 ('years', 'ago'),
 ('lower', 'jaw')]

In [None]:
import re,pandas as pd,numpy as np, nltk
from nltk.corpus import wordnet as wn

# Use path length in wordnet to find word similarity
# find sense of words via synonym set
# n=noun, 01=synonym set for first meaning of the word
deer = wn.synset('deer.n.01')
deer

elk = wn.synset('elk.n.01')
deer.path_similarity(elk)

horse = wn.synset('horse.n.01')
deer.path_similarity(horse)

# Use an information criteria to find word similarity
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
deer.lin_similarity(elk, brown_ic)

deer.lin_similarity(horse, brown_ic)

# Use NLTK Collocation and Association Measures
from nltk.collocations import *
# load some text for examples
from nltk.book import *
# text1 is the book "Moby Dick"
# extract just the words without numbers and sentence marks and make them lower case
text = [w.lower() for w in list(text1) if w.isalpha()]

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)
finder.nbest(bigram_measures.pmi,10)

# find all the bigrams with occurrence of at least 10, this modifies our "finder" object
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi,10)

# Working with Latent Dirichlet Allocation (LDA) in Python
# Several packages available, such as gensim and lda. Text needs to be
# preprocessed: tokenizing, normalizing such as lower-casing, stopword
# removal, stemming, and then transforming into a (sparse) matrix for
# word (bigram, etc) occurences.
# generate a set of preprocessed documents
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.book import *

len(stopwords.words('english'))

stopwords.words('english')

# extract just the stemmed words without numbers and sentence marks and make them lower case
p_stemmer = PorterStemmer()
sw = stopwords.words('english')
doc1 = [p_stemmer.stem(w.lower()) for w in list(text1) if w.isalpha() and not w.lower() in sw]
doc2 = [p_stemmer.stem(w.lower()) for w in list(text2) if w.isalpha() and not w.lower() in sw]
doc3 = [p_stemmer.stem(w.lower()) for w in list(text3) if w.isalpha() and not w.lower() in sw]
doc4 = [p_stemmer.stem(w.lower()) for w in list(text4) if w.isalpha() and not w.lower() in sw]
doc5 = [p_stemmer.stem(w.lower()) for w in list(text5) if w.isalpha() and not w.lower() in sw]
doc_set = [doc1, doc2, doc3, doc4, doc5]

# under Windows this generates a warning
import gensim
from gensim import corpora, models

dictionary = corpora.Dictionary(doc_set)
dictionary

# transform each document into a bag of words
corpus = [dictionary.doc2bow((doc)) for doc in doc_set]

# The corpus contains the 5 documents
# each document is a list of indexed features and occurrence count (freq)
print(type(corpus))
print(type(corpus[0]))
print(type(corpus[0][0]))
print(corpus[0][::2000])

# let's try 4 topics for our 5 documents
# 50 passes takes quite a while, let's try less
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=10)

print(ldamodel.print_topics(num_topics=4, num_words=10))

## Assignment 4 - Document Similarity & Topic Modelling

### Part 1 - Document Similarity

For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

The following functions are provided:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

You will need to finish writing the following functions:
* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. 

*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*

In [None]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None


def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    

    # Your Code Here
    arr = nltk.pos_tag(nltk.word_tokenize(doc))
    arr1 = []
    for i in arr:
        if convert_tag(i[1]) == None:
            if len(wn.synsets(i[0]))!=0:
                arr1.append(wn.synsets(i[0])[0])
        else:
            if len(wn.synsets(i[0],convert_tag(i[1]))) != 0:
                arr1.append(wn.synsets(i[0],convert_tag(i[1]))[0])
    return arr1 # Your Answer Here


def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    
    
    # Your Code Here
    synset1 = s1
    synset2 = s2
    sum_of_largest =0 
    l = len(synset1)
    for s in synset1:
        arr = [wn.path_similarity(s,t) for t in synset2]
        arr1 = [i for i in arr if i != None]
        if len(arr1)!=0:
            sum_of_largest += max(arr1)
        else:
            l -= 1
    return sum_of_largest/l # Your Answer Here


def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

#### test_document_path_similarity

Use this function to check if doc_to_synsets and similarity_score are correct.

*This function should return the similarity score as a float.*

In [None]:
def test_document_path_similarity():
    doc1 = 'This is a function to test document_path_similarity.'
    doc2 = 'Use this function to see if your code in doc_to_synsets \
    and similarity_score is correct!'
    return document_path_similarity(doc1, doc2)

<br>
___
`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [None]:
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('paraphrases.csv')
paraphrases.head()

#### most_similar_docs

Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.

*This function should return a tuple `(D1, D2, similarity_score)`*

In [None]:
def most_similar_docs():
    
    # Your Code Here
    df = paraphrases[paraphrases['Quality']==1]
    D1=df.iloc[0,1]
    D2=df.iloc[0,2]
    similarity_score = document_path_similarity(D1,D2)
    for i in range(1,len(df)):
        if document_path_similarity(df.iloc[i,1],df.iloc[i,2])>similarity_score:
            D1=df.iloc[i,1]
            D2=df.iloc[i,2]
            similarity_score = document_path_similarity(D1,D2)
    return (D1, D2, similarity_score)# Your Answer Here

#### label_accuracy

Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.

*This function should return a float.*

In [None]:
def label_accuracy():
    from sklearn.metrics import accuracy_score

    # Your Code Here
    from sklearn.metrics import accuracy_score

    # Your Code Here
    y_test = paraphrases['Quality']
    my_classifier = []
    for i in range(len(paraphrases)):
        if document_path_similarity(paraphrases.iloc[i,1],paraphrases.iloc[i,2])> 0.75:
            my_classifier.append(1)
        else:
            my_classifier.append(0)
    return accuracy_score(y_test,my_classifier)# Your Answer Here

### Part 2 - Topic Modelling

For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

In [None]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


In [None]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

# Your code here:
ldamodel = gensim.models.ldamodel.LdaModel(corpus,passes=25,random_state=34,id2word = id_map,num_topics = 10)

#### lda_topics

Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')`

for example.

*This function should return a list of tuples.*

In [None]:
def lda_topics():
    
    # Your Code Here
    
    return ldamodel.print_topics(10,10)# Your Answer Here

#### topic_distribution

For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function should return a list of tuples, where each tuple is `(#topic, probability)`*

In [None]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [None]:
def topic_distribution():
    
    # Your Code Here
    vector_mat = vect.transform(new_doc)
    corpus1 = gensim.matutils.Sparse2Corpus(vector_mat, documents_columns=False)
    return list(ldamodel.get_document_topics(corpus1))[0]# Your Answer Here

#### topic_names

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

*This function should return a list of 10 strings.*

In [None]:
def topic_names():
    
    # Your Code Here
    
    return ['Education','Politics','Government','Business','Automobiles',
            'Sports','Health','Religion','Computers & IT','Science']# Your Answer Here