# Web scraping pipeline Part 3

In this section, we finally are able to apply Natural Language Processing techniques to the dataset. Firstly, we connect the scraped Data with their corresponding rows of Courses Dataset. Using regular expressions, tokenization, lemmatization we extract key words from the name, description and requirements of courses.

In [1]:
import re
import json
import numpy as np
import pandas as pd

Natural Language Processing packages:

In [2]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer 
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
import matplotlib.pyplot as plt

Reading scraped Data and course Dataset:

In [3]:
df_scrapedData = pd.read_csv('../Data/interim/df_scrapedAllData.csv')

In [4]:
df_scrapedData.head()

Unnamed: 0,description,rating,audience,counter,course,requirements,language
0,Accounting is one of the most important skills...,4.6,['Aspiring Accountants and Financial Analysts'...,3663.0,640100.0,No prior knowledge of accounting is assumed or...,English
1,This course is an introduction to the financia...,3.6,"['Students in business and Finance', 'Auditors...",33.0,385604.0,some knowledge of accounting,English
2,*Course Fully Updated for May 2019*The don’t c...,4.7,['Anyone interested in earning an extra income...,300.0,834836.0,You will need some basic knowledge of stock an...,English
3,This Mortgage Acceleration course will teach y...,3.7,['This Mortgage Acceleration course is designe...,7.0,504620.0,Students will need a reliable computer and int...,English
4,"This course is for bookkeepers, accountants an...",3.9,['Individuals / Directors who want to submit t...,10.0,359926.0,It would be helpful if you understood accounti...,English


In [5]:
df_coursesSampling = pd.read_csv('../Data/interim/df_samples.csv')

In [6]:
df_coursesSampling.head()

Unnamed: 0,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,contentInfo,publishedTime,category,timeSpent,publishDate,level,paidBool
0,640100,Accounting & Financial Statement Analysis: Com...,https://www.udemy.com/accounting-fsa-a-solid-f...,True,150.0,10042,594,43,All Levels,3 hours,2015-10-22T00:03:48Z,BussinessFinance,3.0,2015-10-22,All Levels,True
1,385604,Introduction to Financial Consolidation under ...,https://www.udemy.com/introduction-to-financia...,True,25.0,21,3,8,All Levels,1.5 hours,2016-12-05T14:18:39Z,BussinessFinance,1.5,2016-12-05,All Levels,True
2,834836,How to Consistently Win Trading Stocks in 30 D...,https://www.udemy.com/winningstocktrades/,True,145.0,1433,169,15,Intermediate Level,1 hour,2016-05-09T05:44:33Z,BussinessFinance,1.0,2016-05-09,Intermediate Level,True
3,504620,Mortgage Acceleration,https://www.udemy.com/mortgage-acceleration/,True,20.0,247,2,17,All Levels,1.5 hours,2015-08-21T18:36:25Z,BussinessFinance,1.5,2015-08-21,All Levels,True
4,359926,UK Tax Returns with HMRC,https://www.udemy.com/corporation-tax-returns-...,True,40.0,2,0,11,Beginner Level,1 hour,2016-04-05T15:48:32Z,BussinessFinance,1.0,2016-04-05,Beginner Level,True


In [7]:
df_courseSampling = df_coursesSampling.join(df_scrapedData, how='outer')

In [8]:
df_courseSampling.tail()

Unnamed: 0,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,contentInfo,...,publishDate,level,paidBool,description,rating,audience,counter,course,requirements,language
1995,1011550,Build Sign Up and Login Forms With Bootstrap M...,https://www.udemy.com/build-sign-up-and-login-...,True,20.0,1898,19,31,All Levels,4 hours,...,2016-11-17,All Levels,True,Add Sign Up ModalAdding Sign Up Form To ModalS...,4.3,['This course is for anyone who wants to learn...,31.0,1011550.0,Basic HTML/CSSMYSQL insert and select queriesT...,English
1996,143028,Code a Responsive Website Using HTML5 and CSS ...,https://www.udemy.com/how-to-code-a-responsive...,True,50.0,1271,136,110,Beginner Level,7.5 hours,...,2014-01-09,Beginner Level,True,Course Overview This course is the equivalen...,4.4,['This course is designed for beginners who wa...,407.0,143028.0,Students will need to download a free copy of ...,English
1997,1179104,Learning Path: React: Make Stunning React Webs...,https://www.udemy.com/learning-path-react-make...,True,200.0,91,5,53,Expert Level,6.5 hours,...,2017-04-18,Expert Level,True,Packt’s Video Learning Paths are a series of i...,3.7,['This course is ideal for web developers. In ...,25.0,1179104.0,Requires working knowledge of ReactJS and some...,English
1998,361620,Wordpress Tutorial,https://www.udemy.com/responsive-design/,True,200.0,2311,8,44,All Levels,1.5 hours,...,2014-12-04,All Levels,True,Are you looking for a step by step video tuto...,4.2,"['Any one who is eager to succeed', 'whoever w...",11.0,361620.0,just a computer a brain and determinationyou d...,English
1999,294408,Learn to Make an Animated Image Gallery using ...,https://www.udemy.com/learn-to-make-an-animate...,False,0.0,11080,165,7,Beginner Level,1 hour,...,2014-11-20,Beginner Level,False,A short and sweet course for all the HTML5 fa...,4.5,"['Front end web develoeprs', 'Web Designers']",326.0,294408.0,"Basic Knowledge of HTML, CSS and JavaScript",English


In [9]:
df_courseSampling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 23 columns):
id                      2000 non-null int64
title                   2000 non-null object
url                     2000 non-null object
isPaid                  2000 non-null bool
price                   2000 non-null float64
numSubscribers          2000 non-null int64
numReviews              2000 non-null int64
numPublishedLectures    2000 non-null int64
instructionalLevel      2000 non-null object
contentInfo             2000 non-null object
publishedTime           2000 non-null object
category                2000 non-null object
timeSpent               2000 non-null float64
publishDate             2000 non-null object
level                   2000 non-null object
paidBool                2000 non-null bool
description             1567 non-null object
rating                  1589 non-null float64
audience                2000 non-null object
counter                 1589 non-null float

In [10]:
df_courseSampling.description.head()

0    Accounting is one of the most important skills...
1    This course is an introduction to the financia...
2    *Course Fully Updated for May 2019*The don’t c...
3    This Mortgage Acceleration course will teach y...
4    This course is for bookkeepers, accountants an...
Name: description, dtype: object

In [11]:
df_courseSampling.description[0]

"Accounting is one of the most important skills for people pursuing a career in Finance.It helps you understand whether a business is profitable.It gives you an idea of a company’s size.It helps you use the past in order to take action in the present and change the future.However, it’s essential that you understand it well. If you want to become…a Financial Analystan Accountantan Auditora Business Analysta Financial Controllera Financial Managera CFOa CEOan Investment Bankeran Equity Research Analystan Investor an Entrepreneur Someone who is involved with a business and would like to be successfulThen you simply have to learn Accounting and Financial Statement Analysis. There is no way around it.But how can you do that if you have very limited time and no prior training? And how can you be sure that you are not missing an important piece of the puzzle?Accounting &amp; Financial Statement Analysis: Complete Training is here for you. One of the best Finance courses available on Udemy, it

## Natural Language pre-processing 

Building **expandContractions** function and definition of stop words vector:

In [12]:
"""
from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
all credits go to alko and arturomp @ stack overflow.
"""

with open('../Data/nlp/wordLists/contractionsList.txt', 'r') as f:
    cList = json.loads(f.read())
    c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

In [13]:
wpt = nltk.WordPunctTokenizer()
lemmatizer = WordNetLemmatizer() 
stemmer = PorterStemmer() 

### **Corpus normalized** of course descriptions

Let's start to normalize the course description of the row 0, whose name is *Accounting & Financial Statement Analysis: Complete Training*

In [14]:
df_courseSampling.title[0]

'Accounting & Financial Statement Analysis: Complete Training'

**Original description**:

In [15]:
df_courseSampling.description[0]

"Accounting is one of the most important skills for people pursuing a career in Finance.It helps you understand whether a business is profitable.It gives you an idea of a company’s size.It helps you use the past in order to take action in the present and change the future.However, it’s essential that you understand it well. If you want to become…a Financial Analystan Accountantan Auditora Business Analysta Financial Controllera Financial Managera CFOa CEOan Investment Bankeran Equity Research Analystan Investor an Entrepreneur Someone who is involved with a business and would like to be successfulThen you simply have to learn Accounting and Financial Statement Analysis. There is no way around it.But how can you do that if you have very limited time and no prior training? And how can you be sure that you are not missing an important piece of the puzzle?Accounting &amp; Financial Statement Analysis: Complete Training is here for you. One of the best Finance courses available on Udemy, it

Before extract meaningful words, there is a lot of work to do. We note the lack of spaces between miex words; the use of contractions and stopwords. Let's define a function that:
1. Expand contractions
2. Incorporate an extra spaces between mixed words.
3. Remove special characters (once they have been isolated)
4. Apply tokenization
5. Filtering stop-words
6. Apply lemmatization to get the root word

In [16]:
def corpus_normalization(original_text, stop_words):
    """
    This function receives the origin text and return a corpus normalized. If the stop words are updated, 
    we exclude the word 'no'. In requirements, the difference between 'No previous knowlegde required' 
    and 'previous knowlegde required' is crucial. In description of courses, we are looking for key-words 
    and the edition of the stop words from nltk is not necessary. 
    """
    # Mixed words have uppercase after lowercase letters. Ex:learnedExcellent
    pattern = '[a-z][A-Z]'
    
    if original_text is np.nan or original_text is None:
        return None
    else:
        text = re.sub(r'’',"'", original_text)
        text = expandContractions(text)
        word_list = wpt.tokenize(text)
        # Incorporate extra spaces between mixed words
        words_edit = []
        for word in word_list:
            if re.search(pattern, word): 
                index = re.search(pattern, word).start()
                word = word[:index+1] + ' ' + word[index+1:]
            words_edit.append(word)

        text_filtered = ' '.join(words_edit)
        # Filtering special characters
        text_filtered  = re.sub(r'[^a-zA-Z\s]','', text_filtered)
        text_filtered  = text_filtered.lower()
        # Tokenization to filter stopwords and retrieve roots from derivated words
        tokens = wpt.tokenize(text_filtered)
        words_lem = [lemmatizer.lemmatize(word) for word in tokens if lemmatizer.lemmatize(word) not in stop_words]
        text_norm = ' '.join(words_lem)
        
    return text_norm

In [17]:
stop_words = set(stopwords.words('english')) 

In [18]:
df_courseSampling['normalized_descriptions'] = df_courseSampling.description.apply(corpus_normalization, stop_words=stop_words)

In [19]:
df_courseSampling['normalized_descriptions'].head()

0    accounting one important skill people pursuing...
1    course introduction financial consolidation if...
2    course fully updated may call trading profit n...
3    mortgage acceleration course teach mortgage wo...
4    course bookkeeper accountant uk limited compan...
Name: normalized_descriptions, dtype: object

### Corpus normalized of course requirements

In [20]:
to_remove = ['no']
update_stop_words = set(stopwords.words('english')).difference(to_remove)

In [21]:
df_courseSampling['normalized_requirements'] = df_courseSampling.requirements.apply(corpus_normalization, stop_words=update_stop_words)

In [22]:
df_courseSampling['normalized_requirements'].head()

0    no prior knowledge accounting assumed needed n...
1                                 knowledge accounting
2    need basic knowledge stock option trading need...
3    student need reliable computer internet connec...
4    would helpful understood accounting terminolog...
Name: normalized_requirements, dtype: object

In [23]:
df_courseSampling.loc[:, ['requirements', 'normalized_requirements']].head(20)

Unnamed: 0,requirements,normalized_requirements
0,No prior knowledge of accounting is assumed or...,no prior knowledge accounting assumed needed n...
1,some knowledge of accounting,knowledge accounting
2,You will need some basic knowledge of stock an...,need basic knowledge stock option trading need...
3,Students will need a reliable computer and int...,student need reliable computer internet connec...
4,It would be helpful if you understood accounti...,would helpful understood accounting terminolog...
5,A computer or Android phone with internet access,computer android phone internet access
6,"Have a copy of Microsoft Excel, Google Sheets ...",copy microsoft excel google sheet another spre...
7,,
8,A Passion to learn and a trading account,passion learn trading account
9,,


In [24]:
def key_word_extractor(norm_corpus):
    tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True, ngram_range=(2,2), max_features=40)
    tv_matrix = tv.fit_transform(norm_corpus)
    tv_matrix = tv_matrix.toarray()

    idx = [i for i in range(len(norm_corpus))]
    vocab = tv.get_feature_names()
    df_tfidV = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab, index=idx)

    key_values = []
    key_words = []
    for i in range(len(norm_corpus)):
        for column in vocab:
            if df_tfidV[column][i] > 0.20:
                key_values.append(df_tfidV[column][i])
                key_words.append(column)
    return key_words, key_values