# Web scraping pipeline Part 3

In this section, we finally are able to apply Natural Language Processing techniques to the dataset. Firstly, we connect the scraped Data with their corresponding rows of Courses Dataset. Using regular expressions, tokenization, lemmatization we extract key words from the name, description and requirements of courses.

In [1]:
import re
import json
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Natural Language Processing packages:

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer 
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

Language detection:

In [3]:
# !{sys.executable} -m pip install langdetect
from langdetect import detect

Reading scraped Data and course Dataset:

In [4]:
df_scrapedData = pd.read_csv('../Data/interim/df_scrapedAllData.csv')

In [5]:
df_scrapedData.head()

Unnamed: 0,description,rating,audience,counter,course,requirements,language
0,Accounting is one of the most important skills...,4.6,['Aspiring Accountants and Financial Analysts'...,3663.0,640100.0,No prior knowledge of accounting is assumed or...,English
1,This course is an introduction to the financia...,3.6,"['Students in business and Finance', 'Auditors...",33.0,385604.0,some knowledge of accounting,English
2,*Course Fully Updated for May 2019*The don’t c...,4.7,['Anyone interested in earning an extra income...,300.0,834836.0,You will need some basic knowledge of stock an...,English
3,This Mortgage Acceleration course will teach y...,3.7,['This Mortgage Acceleration course is designe...,7.0,504620.0,Students will need a reliable computer and int...,English
4,"This course is for bookkeepers, accountants an...",3.9,['Individuals / Directors who want to submit t...,10.0,359926.0,It would be helpful if you understood accounti...,English


In [6]:
df_coursesSampling = pd.read_csv('../Data/interim/df_samples.csv')

In [7]:
df_coursesSampling.head()

Unnamed: 0,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,contentInfo,publishedTime,category,timeSpent,publishDate,level,paidBool
0,640100,Accounting & Financial Statement Analysis: Com...,https://www.udemy.com/accounting-fsa-a-solid-f...,True,150.0,10042,594,43,All Levels,3 hours,2015-10-22T00:03:48Z,BussinessFinance,3.0,2015-10-22,All Levels,True
1,385604,Introduction to Financial Consolidation under ...,https://www.udemy.com/introduction-to-financia...,True,25.0,21,3,8,All Levels,1.5 hours,2016-12-05T14:18:39Z,BussinessFinance,1.5,2016-12-05,All Levels,True
2,834836,How to Consistently Win Trading Stocks in 30 D...,https://www.udemy.com/winningstocktrades/,True,145.0,1433,169,15,Intermediate Level,1 hour,2016-05-09T05:44:33Z,BussinessFinance,1.0,2016-05-09,Intermediate Level,True
3,504620,Mortgage Acceleration,https://www.udemy.com/mortgage-acceleration/,True,20.0,247,2,17,All Levels,1.5 hours,2015-08-21T18:36:25Z,BussinessFinance,1.5,2015-08-21,All Levels,True
4,359926,UK Tax Returns with HMRC,https://www.udemy.com/corporation-tax-returns-...,True,40.0,2,0,11,Beginner Level,1 hour,2016-04-05T15:48:32Z,BussinessFinance,1.0,2016-04-05,Beginner Level,True


In [8]:
df_courseSampling = df_coursesSampling.join(df_scrapedData, how='outer')

In [9]:
df_courseSampling.tail()

Unnamed: 0,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,contentInfo,...,publishDate,level,paidBool,description,rating,audience,counter,course,requirements,language
1995,1011550,Build Sign Up and Login Forms With Bootstrap M...,https://www.udemy.com/build-sign-up-and-login-...,True,20.0,1898,19,31,All Levels,4 hours,...,2016-11-17,All Levels,True,Add Sign Up ModalAdding Sign Up Form To ModalS...,4.3,['This course is for anyone who wants to learn...,31.0,1011550.0,Basic HTML/CSSMYSQL insert and select queriesT...,English
1996,143028,Code a Responsive Website Using HTML5 and CSS ...,https://www.udemy.com/how-to-code-a-responsive...,True,50.0,1271,136,110,Beginner Level,7.5 hours,...,2014-01-09,Beginner Level,True,Course Overview This course is the equivalen...,4.4,['This course is designed for beginners who wa...,407.0,143028.0,Students will need to download a free copy of ...,English
1997,1179104,Learning Path: React: Make Stunning React Webs...,https://www.udemy.com/learning-path-react-make...,True,200.0,91,5,53,Expert Level,6.5 hours,...,2017-04-18,Expert Level,True,Packt’s Video Learning Paths are a series of i...,3.7,['This course is ideal for web developers. In ...,25.0,1179104.0,Requires working knowledge of ReactJS and some...,English
1998,361620,Wordpress Tutorial,https://www.udemy.com/responsive-design/,True,200.0,2311,8,44,All Levels,1.5 hours,...,2014-12-04,All Levels,True,Are you looking for a step by step video tuto...,4.2,"['Any one who is eager to succeed', 'whoever w...",11.0,361620.0,just a computer a brain and determinationyou d...,English
1999,294408,Learn to Make an Animated Image Gallery using ...,https://www.udemy.com/learn-to-make-an-animate...,False,0.0,11080,165,7,Beginner Level,1 hour,...,2014-11-20,Beginner Level,False,A short and sweet course for all the HTML5 fa...,4.5,"['Front end web develoeprs', 'Web Designers']",326.0,294408.0,"Basic Knowledge of HTML, CSS and JavaScript",English


In [10]:
df_courseSampling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 23 columns):
id                      2000 non-null int64
title                   2000 non-null object
url                     2000 non-null object
isPaid                  2000 non-null bool
price                   2000 non-null float64
numSubscribers          2000 non-null int64
numReviews              2000 non-null int64
numPublishedLectures    2000 non-null int64
instructionalLevel      2000 non-null object
contentInfo             2000 non-null object
publishedTime           2000 non-null object
category                2000 non-null object
timeSpent               2000 non-null float64
publishDate             2000 non-null object
level                   2000 non-null object
paidBool                2000 non-null bool
description             1567 non-null object
rating                  1589 non-null float64
audience                2000 non-null object
counter                 1589 non-null float

In [11]:
df_courseSampling.dropna(inplace=True)

In [12]:
df_courseSampling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1999
Data columns (total 23 columns):
id                      1460 non-null int64
title                   1460 non-null object
url                     1460 non-null object
isPaid                  1460 non-null bool
price                   1460 non-null float64
numSubscribers          1460 non-null int64
numReviews              1460 non-null int64
numPublishedLectures    1460 non-null int64
instructionalLevel      1460 non-null object
contentInfo             1460 non-null object
publishedTime           1460 non-null object
category                1460 non-null object
timeSpent               1460 non-null float64
publishDate             1460 non-null object
level                   1460 non-null object
paidBool                1460 non-null bool
description             1460 non-null object
rating                  1460 non-null float64
audience                1460 non-null object
counter                 1460 non-null float

In [13]:
df_courseSampling.description.head()

0    Accounting is one of the most important skills...
1    This course is an introduction to the financia...
2    *Course Fully Updated for May 2019*The don’t c...
3    This Mortgage Acceleration course will teach y...
4    This course is for bookkeepers, accountants an...
Name: description, dtype: object

In [14]:
df_courseSampling.description[0]

"Accounting is one of the most important skills for people pursuing a career in Finance.It helps you understand whether a business is profitable.It gives you an idea of a company’s size.It helps you use the past in order to take action in the present and change the future.However, it’s essential that you understand it well. If you want to become…a Financial Analystan Accountantan Auditora Business Analysta Financial Controllera Financial Managera CFOa CEOan Investment Bankeran Equity Research Analystan Investor an Entrepreneur Someone who is involved with a business and would like to be successfulThen you simply have to learn Accounting and Financial Statement Analysis. There is no way around it.But how can you do that if you have very limited time and no prior training? And how can you be sure that you are not missing an important piece of the puzzle?Accounting &amp; Financial Statement Analysis: Complete Training is here for you. One of the best Finance courses available on Udemy, it

## Natural Language pre-processing 

### 1. Language detection

In [15]:
def english_detection(string):
    if detect(string) == 'en':
        return True
    else:
        return False

### 2. Building **expandContractions** function

In [16]:
"""
from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
all credits go to alko and arturomp @ stack overflow.
"""

with open('../Data/nlp/wordLists/contractionsList.txt', 'r') as f:
    cList = json.loads(f.read())
    c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

In [17]:
wpt = nltk.WordPunctTokenizer()
lemmatizer = WordNetLemmatizer() 
stemmer = PorterStemmer() 

### 3. **Corpus normalized** of course descriptions

Let's start to normalize the course description of the row 0, whose name is *Accounting & Financial Statement Analysis: Complete Training*

In [18]:
df_courseSampling.title[0]

'Accounting & Financial Statement Analysis: Complete Training'

**Original description**:

In [19]:
df_courseSampling.description[0]

"Accounting is one of the most important skills for people pursuing a career in Finance.It helps you understand whether a business is profitable.It gives you an idea of a company’s size.It helps you use the past in order to take action in the present and change the future.However, it’s essential that you understand it well. If you want to become…a Financial Analystan Accountantan Auditora Business Analysta Financial Controllera Financial Managera CFOa CEOan Investment Bankeran Equity Research Analystan Investor an Entrepreneur Someone who is involved with a business and would like to be successfulThen you simply have to learn Accounting and Financial Statement Analysis. There is no way around it.But how can you do that if you have very limited time and no prior training? And how can you be sure that you are not missing an important piece of the puzzle?Accounting &amp; Financial Statement Analysis: Complete Training is here for you. One of the best Finance courses available on Udemy, it

Before extract meaningful words, there is a lot of work to do. We note the lack of spaces between miex words; the use of contractions and stopwords. Let's define a function that:
1. Expand contractions
2. Incorporate an extra spaces between mixed words.
3. Remove special characters (once they have been isolated)
4. Apply tokenization
5. Filtering stop-words
6. Apply lemmatization to get the root word

In [20]:
def corpus_normalization(original_text, stop_words, wordlist=False):
    """
    This function receives the origin text and return a corpus normalized. If the stop words are updated, 
    we exclude the word 'no'. In requirements, the difference between 'No previous knowlegde required' 
    and 'previous knowlegde required' is crucial. In description of courses, we are looking for key-words 
    and the edition of the stop words from nltk is not necessary. 
    """
    # Mixed words have uppercase after lowercase letters. Ex:learnedExcellent
    pattern = '[a-z][A-Z]'
    # Return empty string if input is not string, np.nan, None
    if (original_text is None) or (type(original_text) is not str) or (original_text in ['None', 'NIL']):
        return ''
    # Return empty string if input is a non-english string:
    elif english_detection(original_text) is False:
        return ''
    else:
        text = re.sub(r'’',"'", original_text)
        text = expandContractions(text)
        word_list = wpt.tokenize(text)
        # Incorporate extra spaces between mixed words
        words_edit = []
        for word in word_list:
            if re.search(pattern, word): 
                index = re.search(pattern, word).start()
                word = word[:index+1] + ' ' + word[index+1:]
            words_edit.append(word)

        text_filtered = ' '.join(words_edit)
        # Filtering special characters
        text_filtered  = re.sub(r'[^a-zA-Z\s]','', text_filtered)
        text_filtered  = text_filtered.lower()
        # Tokenization to filter stopwords and retrieve roots from derivated words
        tokens = wpt.tokenize(text_filtered)
        words_lem = [lemmatizer.lemmatize(word) for word in tokens if lemmatizer.lemmatize(word) not in stop_words and len(word) > 3]
        if wordlist:
            return words_lem
        text_norm = ' '.join(words_lem)
        
    return text_norm

In [21]:
stop_words = stopwords.words('english')
stop_words.extend(['udemy', 'course', 'school', 'lesson', 'ucstrong', 'rating', 'time', 
                   'html', 'student', 'section', 'professional', 'also', 'using', 'want', 
                   'make', 'take', 'need', 'easy', 'free', 'help', 'basic', 'lecture'])

In [22]:
df_courseSampling['normalized_descriptions'] = df_courseSampling.description.apply(corpus_normalization, stop_words=stop_words)

In [23]:
df_courseSampling['normalized_descriptions'].head()

0    accounting important skill people pursuing car...
1    introduction financial consolidation ifrs aim ...
2    fully updated call trading profit nothing lite...
3    mortgage acceleration teach mortgage work beat...
4    bookkeeper accountant limited company director...
Name: normalized_descriptions, dtype: object

### Corpus normalized of course requirements

In [24]:
to_remove = ['no']
update_stop_words = set(stopwords.words('english')).difference(to_remove)

In [25]:
df_courseSampling['normalized_requirements'] = df_courseSampling.requirements.apply(corpus_normalization, stop_words=update_stop_words)

In [26]:
df_courseSampling['normalized_requirements'].head()

0    prior knowledge accounting assumed needed noth...
1                                 knowledge accounting
2    need basic knowledge stock option trading need...
3    student need reliable computer internet connec...
4    would helpful understood accounting terminolog...
Name: normalized_requirements, dtype: object

In [27]:
df_courseSampling.loc[:, ['description', 'normalized_descriptions', 'requirements', 'normalized_requirements']].head()

Unnamed: 0,description,normalized_descriptions,requirements,normalized_requirements
0,Accounting is one of the most important skills...,accounting important skill people pursuing car...,No prior knowledge of accounting is assumed or...,prior knowledge accounting assumed needed noth...
1,This course is an introduction to the financia...,introduction financial consolidation ifrs aim ...,some knowledge of accounting,knowledge accounting
2,*Course Fully Updated for May 2019*The don’t c...,fully updated call trading profit nothing lite...,You will need some basic knowledge of stock an...,need basic knowledge stock option trading need...
3,This Mortgage Acceleration course will teach y...,mortgage acceleration teach mortgage work beat...,Students will need a reliable computer and int...,student need reliable computer internet connec...
4,"This course is for bookkeepers, accountants an...",bookkeeper accountant limited company director...,It would be helpful if you understood accounti...,would helpful understood accounting terminolog...


In [28]:
df_courseSampling.reset_index(inplace=True)

## Topic Modeling using Latent Dirichlet Allocation (LDA)

The first purpose is to recognize the topic of every course: 60% of the dataset is used to train a LDA model with the description of courses, chosing courses from the four different categories **Web Development**, **Graphic Design**, **Music and Instrument** and **Bussiness Finance** to avoid unbalanced or biased training dataset.
In this section, we use gensim pachage to generate our Bag of Words.

In [29]:
df_english = df_courseSampling[df_courseSampling.normalized_descriptions != ''].reindex()

In [30]:
df_english.groupby('category').count()

Unnamed: 0_level_0,index,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,...,paidBool,description,rating,audience,counter,course,requirements,language,normalized_descriptions,normalized_requirements
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BussinessFinance,327,327,327,327,327,327,327,327,327,327,...,327,327,327,327,327,327,327,327,327,327
GraphicDesign,306,306,306,306,306,306,306,306,306,306,...,306,306,306,306,306,306,306,306,306,306
MusicInstrument,369,369,369,369,369,369,369,369,369,369,...,369,369,369,369,369,369,369,369,369,369
WebDevelopment,401,401,401,401,401,401,401,401,401,401,...,401,401,401,401,401,401,401,401,401,401


In [31]:
df_train = df_english.groupby('category').apply(lambda x: x.sample(frac=0.6, random_state=1)).reset_index(drop=True)

In [32]:
df_train.groupby('category').count()

Unnamed: 0_level_0,index,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,...,paidBool,description,rating,audience,counter,course,requirements,language,normalized_descriptions,normalized_requirements
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BussinessFinance,196,196,196,196,196,196,196,196,196,196,...,196,196,196,196,196,196,196,196,196,196
GraphicDesign,184,184,184,184,184,184,184,184,184,184,...,184,184,184,184,184,184,184,184,184,184
MusicInstrument,221,221,221,221,221,221,221,221,221,221,...,221,221,221,221,221,221,221,221,221,221
WebDevelopment,241,241,241,241,241,241,241,241,241,241,...,241,241,241,241,241,241,241,241,241,241


In [33]:
df_train.head()

Unnamed: 0,index,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,...,paidBool,description,rating,audience,counter,course,requirements,language,normalized_descriptions,normalized_requirements
0,89,455452,"Accounting, Finance and Banking - A Comprehens...",https://www.udemy.com/accounting-finance-and-b...,True,180.0,507,36,395,All Levels,...,True,"Welcome to this dream course ""Accounting, Fina...",3.9,"['Accounting Students', 'Banking Students', 'F...",168.0,455452.0,This course will teach you from basic to advan...,English,welcome dream accounting finance banking compr...,course teach basic advanced concept hence take...
1,391,302562,Introduction to Accounting: The Language of Bu...,https://www.udemy.com/learnaccountingforfree/,True,20.0,11958,370,134,Beginner Level,...,True,Learn accounting from the self-made millionair...,4.6,"['Students', ' Entrepreneurs', ' and anyone wh...",1470.0,302562.0,"Elementary math skills (i.e., basic addition s...",English,learn accounting self made millionaire norm ne...,elementary math skill basic addition subtracti...
2,419,1273896,Covered Calls - Powerful Income Strategy for S...,https://www.udemy.com/covered-calls-income-str...,True,60.0,22,0,8,Beginner Level,...,True,If you're a Stock trader or Long term stock in...,4.2,"['All Stock Traders', 'Self-directed investors...",33.0,1273896.0,Trading Stocks and holding a stock portfolioAl...,English,stock trader long term stock investor hold sto...,trading stock holding stock portfolio although...
3,341,1045726,Stock Market investment:Non financial fundamen...,https://www.udemy.com/stock-market-investmentn...,True,125.0,1091,4,29,Beginner Level,...,True,"As an investor, how often do you findyourself ...",3.3,['Investors who wish to do their own stock mar...,28.0,1045726.0,Having some knowledge of accounting would be g...,English,investor often findyourself lost read company ...,knowledge accounting would good mandatory
4,190,42643,Master Iron Condors - Double the credit for ha...,https://www.udemy.com/iron-condor-options-trad...,True,60.0,1338,139,7,Expert Level,...,True,THE IRON CONDOR STRATEGY - KING OF TIME DECAY ...,4.4,"['This is an advanced level course', ' so stud...",290.0,42643.0,Excellent knowledge of Credit spreads and all ...,English,iron condor strategy king decay strategiesthe ...,excellent knowledge credit spread material exc...


In [34]:
df_english.groupby('category').count()

Unnamed: 0_level_0,index,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,...,paidBool,description,rating,audience,counter,course,requirements,language,normalized_descriptions,normalized_requirements
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BussinessFinance,327,327,327,327,327,327,327,327,327,327,...,327,327,327,327,327,327,327,327,327,327
GraphicDesign,306,306,306,306,306,306,306,306,306,306,...,306,306,306,306,306,306,306,306,306,306
MusicInstrument,369,369,369,369,369,369,369,369,369,369,...,369,369,369,369,369,369,369,369,369,369
WebDevelopment,401,401,401,401,401,401,401,401,401,401,...,401,401,401,401,401,401,401,401,401,401


In [35]:
print('Number of documents: {}'.format(len(df_train)))

Number of documents: 842


In [36]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [37]:
# !{sys.executable} -m pip install pyLDAvis
# !{sys.executable} -m pip install -U gensim

In [38]:
# !{sys.executable} -m pip install paramiko

In [39]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, TfidfModel, LdaMulticore

In [40]:
# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

In [41]:
df_train['wordList_description'] = df_train.description.apply(corpus_normalization, stop_words=stop_words, wordlist=True)

In [42]:
data_words = list(df_train['wordList_description'])

In [43]:
data_words[1]

['learn',
 'accounting',
 'self',
 'made',
 'millionaire',
 'norm',
 'nemrow',
 'recipient',
 'famed',
 'teaching',
 'award',
 'president',
 'united',
 'state',
 'produced',
 'accounting',
 'university',
 'world',
 'brigham',
 'young',
 'university',
 'rated',
 'london',
 'financial',
 'teach',
 'fundamental',
 'financial',
 'accounting',
 'better',
 'effectively',
 'available',
 'today',
 'guaranteed',
 'talk',
 'good',
 'game',
 'gold',
 'standard',
 'specific',
 'prepares',
 'recruit',
 'four',
 'accounting',
 'firm',
 'worldwide',
 'recommended',
 'harvard',
 'incoming',
 'join',
 'university',
 'world',
 'famous',
 'making',
 'available',
 'accounting',
 'knowledge',
 'highlighted',
 'wired',
 'magazine',
 'gagaom',
 'york',
 'popular',
 'well',
 'acclaimed',
 'welcome',
 'norm',
 'nemrow',
 'accounting',
 'nnac',
 'accounting',
 'introduction',
 'accounting',
 'series',
 'composed',
 'five',
 'challenge',
 'learn',
 'content',
 'five',
 'mastered',
 'first',
 'year',
 'accounting

### Bag of words: 

Then, a dictionary from the corpus normalized is created to count the number of times a word appears in the training set:

In [44]:
# Create Dictionary
dictionary = corpora.Dictionary(data_words)

Filtering extremes tokens:
Let's filter tokens that appears in less than 5% of the documents and more than 50% of them.

In [114]:
dictionary.filter_extremes(no_below=40, no_above=0.6)

In [115]:
# Create Corpus
texts = data_words

Using **gensim.doc2bow** we create a dictionary for each document reporting how many words and times the words appear there. This is called corpus and the first corpus is printed as index and then as readable words:

In [116]:
# Term Document Frequency
bow_corpus = [dictionary.doc2bow(text) for text in texts]

print(bow_corpus[:1])

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 5), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 3), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]]


In [117]:
# Human readable format of corpus (term-frequency)
[[(dictionary[id], freq) for id, freq in cp] for cp in bow_corpus[:1]]

[[('access', 1),
  ('advanced', 2),
  ('beginner', 1),
  ('best', 1),
  ('business', 1),
  ('complete', 1),
  ('concept', 2),
  ('content', 1),
  ('financial', 5),
  ('going', 1),
  ('introduction', 1),
  ('knowledge', 1),
  ('learning', 1),
  ('minute', 2),
  ('money', 1),
  ('practical', 1),
  ('process', 3),
  ('question', 1),
  ('right', 1),
  ('skill', 1),
  ('study', 2),
  ('style', 1),
  ('system', 1),
  ('teach', 1),
  ('type', 2),
  ('value', 1),
  ('video', 1),
  ('without', 1),
  ('working', 1),
  ('would', 1)]]

### TF-IDF

TF-IDF model from gensim.models:

In [118]:
tfidf = TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [119]:
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus_tfidf[:1]]

[[('access', 0.09747429936125805),
  ('advanced', 0.20738850670122774),
  ('beginner', 0.08034732857236077),
  ('best', 0.08173777018745375),
  ('business', 0.10928775386491142),
  ('complete', 0.0892361176570383),
  ('concept', 0.18710681277426944),
  ('content', 0.08796961694773386),
  ('financial', 0.6294841649156842),
  ('going', 0.10928775386491142),
  ('introduction', 0.13091125892944014),
  ('knowledge', 0.08145701427938666),
  ('learning', 0.05877710289016544),
  ('minute', 0.2742439887022831),
  ('money', 0.07097576390623185),
  ('practical', 0.13359020163130383),
  ('process', 0.34196527992258957),
  ('question', 0.10050130426754525),
  ('right', 0.0740995537906088),
  ('skill', 0.06604340331563528),
  ('study', 0.2579598389360193),
  ('style', 0.10974111598304925),
  ('system', 0.12240037950500783),
  ('teach', 0.07144557885709953),
  ('type', 0.22995170524330416),
  ('value', 0.12835011075914451),
  ('video', 0.04966140070488852),
  ('without', 0.10369425335061387),
  ('wor

### Running LDA using Bow and TFIDF corpus

In [124]:
ldaMulticore_model = LdaMulticore(bow_corpus, num_topics=4, id2word=dictionary, passes=20, workers=2, random_state=5)

In [125]:
ldaMulticore_model.print_topics()

[(0,
  '0.050*"trading" + 0.031*"market" + 0.029*"money" + 0.016*"like" + 0.015*"know" + 0.015*"financial" + 0.014*"step" + 0.013*"year" + 0.013*"price" + 0.013*"back"'),
 (1,
  '0.064*"music" + 0.057*"play" + 0.048*"chord" + 0.031*"playing" + 0.029*"song" + 0.022*"video" + 0.020*"practice" + 0.018*"technique" + 0.017*"note" + 0.016*"work"'),
 (2,
  '0.033*"website" + 0.021*"development" + 0.020*"code" + 0.020*"create" + 0.019*"application" + 0.019*"business" + 0.018*"data" + 0.017*"build" + 0.015*"developer" + 0.014*"skill"'),
 (3,
  '0.066*"design" + 0.037*"create" + 0.034*"script" + 0.033*"project" + 0.025*"tool" + 0.024*"graphic" + 0.023*"image" + 0.017*"work" + 0.016*"cover" + 0.015*"step"')]

Let's calculate the perplexity and coherence score, two values used to evaluate quantitatively topic models. 

In [126]:
# Compute Perplexity
print('\nPerplexity: ', ldaMulticore_model.log_perplexity(bow_corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldaMulticore_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -5.014121376535104

Coherence Score:  0.46953959023143516


In [127]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldaMulticore_model, bow_corpus, dictionary)
vis

Save the model to disk importing `datapath`:

In [130]:
from gensim.test.utils import datapath

temp_file = datapath('gensim_ldaMulticoreModel.csv')
coherence_model_lda.save(temp_file)

In [77]:
# lda_model = gensim.models.ldamodel.LdaModel(corpus=bow_corpus,
#            id2word=dictionary,
#            num_topics=4, 
#            random_state=100,
#            update_every=1,
# #            chunksize=300,
#            passes=2,
#            alpha='auto',
#            per_word_topics=True)

In [133]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=4, id2word=dictionary, passes=20, workers=4, random_state=10)

In [134]:
lda_model_tfidf.print_topics()

[(0,
  '0.049*"design" + 0.030*"graphic" + 0.023*"image" + 0.022*"tool" + 0.021*"create" + 0.015*"project" + 0.014*"technique" + 0.013*"software" + 0.013*"step" + 0.013*"cover"'),
 (1,
  '0.034*"website" + 0.026*"script" + 0.025*"code" + 0.024*"application" + 0.021*"development" + 0.017*"project" + 0.016*"data" + 0.016*"page" + 0.015*"developer" + 0.015*"create"'),
 (2,
  '0.056*"trading" + 0.042*"financial" + 0.035*"market" + 0.023*"business" + 0.017*"money" + 0.014*"company" + 0.014*"value" + 0.013*"future" + 0.013*"price" + 0.010*"long"'),
 (3,
  '0.048*"music" + 0.042*"play" + 0.034*"chord" + 0.029*"playing" + 0.027*"song" + 0.019*"note" + 0.013*"musical" + 0.012*"video" + 0.012*"teacher" + 0.012*"technique"')]

In [135]:
# Compute Perplexity
print('\nPerplexity: ', lda_model_tfidf.log_perplexity(corpus_tfidf))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_lda_model_tfidf = CoherenceModel(model=lda_model_tfidf, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda_model_tfidf = coherence_lda_model_tfidf.get_coherence()
print('\nCoherence Score: ', coherence_lda_model_tfidf)


Perplexity:  -5.8458678583775665

Coherence Score:  0.5221226690853776


In [136]:
pyLDAvis.enable_notebook()
vis_tfidf = pyLDAvis.gensim.prepare(lda_model_tfidf, corpus_tfidf, dictionary)
vis_tfidf

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [145]:
temp_file_tfidf = datapath('gensim_ldaMulticoreTFIDFModel.csv')
coherence_model_lda.save(temp_file_tfidf)

### Performance evaluation Bag of words

In [138]:
def test_generator(id_course):
    if id_course in list(df_train.id):
        return 'train'
    else:
        return 'test'

In [139]:
df_english['type_data'] = df_english.id.apply(test_generator)
df_test = df_english[df_english.type_data == 'test']

In [140]:
df_test.groupby('category').count()

Unnamed: 0_level_0,index,id,title,url,isPaid,price,numSubscribers,numReviews,numPublishedLectures,instructionalLevel,...,description,rating,audience,counter,course,requirements,language,normalized_descriptions,normalized_requirements,type_data
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BussinessFinance,131,131,131,131,131,131,131,131,131,131,...,131,131,131,131,131,131,131,131,131,131
GraphicDesign,122,122,122,122,122,122,122,122,122,122,...,122,122,122,122,122,122,122,122,122,122
MusicInstrument,148,148,148,148,148,148,148,148,148,148,...,148,148,148,148,148,148,148,148,148,148
WebDevelopment,160,160,160,160,160,160,160,160,160,160,...,160,160,160,160,160,160,160,160,160,160


In [141]:
df_test['wordList_description'] = df_test.description.apply(corpus_normalization, stop_words=stop_words, wordlist=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [142]:
test_words = list(df_test['wordList_description'])

In [143]:
bow_tests = [dictionary.doc2bow(text) for text in test_words]