We're going to use the Pluralsight courses dataset

In [2]:
import pandas as pd

content_df = pd.read_csv('pluralsight_courses.csv')

In [5]:
text_data_df = content_df[['CourseId','CourseTitle','Description']]
text_data_df.head(100)


Unnamed: 0,CourseId,CourseTitle,Description
0,abts-advanced-topics,BizTalk 2006 Business Process Management,This course covers Business Process Management...
1,abts-fundamentals,BizTalk 2006 Fundamentals,Despite the trend towards service-oriented arc...
2,agile-team-practice-fundamentals,Agile Team Practices with Scrum,This course is much different than most of the...
3,appsrv-fundamentals,Windows Server AppFabric Fundamentals,
4,aspdotnet-advanced-topics,ASP.NET 3.5 Advanced Topics,This course covers more advanced topics in ASP...
...,...,...,...
95,intro-sql-server,Introduction to SQL Server,This course starts with a high level introduct...
96,ssrs-adv,Reporting Services Advanced Topics,This course covers advanced topics in Reportin...
97,jscript-fundamentals,JavaScript Fundamentals,This course introduces JavaScript by examining...
98,wp7-core,Core Windows Phone 7 Development,This course will introduce you to core concept...


In [6]:
missing_values = content_df.isnull().sum()
print(missing_values)

CourseId             0
CourseTitle          0
DurationInSeconds    0
ReleaseDate          0
Description          5
AssessmentStatus     0
IsCourseRetired      0
dtype: int64


In [22]:
text_data_df.dropna(subset=['Description'], inplace=True)
missing_values = text_data_df.isnull().sum()
print(missing_values)

CourseId                  0
CourseTitle               0
Description               0
CourseTitle_Lowercased    0
Description_Lowercased    0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text_data_df.dropna(subset=['Description'], inplace=True)


Lowercasing the data

In [14]:
text_data_df['CourseTitle_Lowercased'] = content_df['CourseTitle'].str.lower()
text_data_df['Description_Lowercased'] = content_df['Description'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text_data_df['CourseTitle_Lowercased'] = content_df['CourseTitle'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text_data_df['Description_Lowercased'] = content_df['Description'].str.lower()


In [24]:
text_data_df.head()
missing_values = text_data_df.isnull().sum()
print(missing_values)

CourseId                  0
CourseTitle               0
Description               0
CourseTitle_Lowercased    0
Description_Lowercased    0
dtype: int64


Tokenization

In [27]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text_data_df['CourseTitle_tokens'] = text_data_df['CourseTitle_Lowercased'].apply(word_tokenize)
text_data_df['Description_tokens'] = text_data_df['Description_Lowercased'].apply(word_tokenize)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Abirr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
text_data_df.head()

Filtering stop words

In [30]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

text_data_df['CourseTitle_noStop'] = text_data_df['CourseTitle_tokens'].apply(lambda x: [word for word in x if word not in stop_words])
text_data_df['Description_noStop'] = text_data_df['Description_tokens'].apply(lambda x: [word for word in x if word not in stop_words])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Abirr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Lemmatization

In [33]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
text_data_df['CourseTitle_lemmatized'] = text_data_df['CourseTitle_noStop'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
text_data_df['Description_lemmatized'] = text_data_df['Description_noStop'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Abirr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Abirr\AppData\Roaming\nltk_data...


In [34]:
text_data_df.head()

Unnamed: 0,CourseId,CourseTitle,Description,CourseTitle_Lowercased,Description_Lowercased,title_tokens,desc_tokens,CourseTitle_tokens,Description_tokens,CourseTitle_noStop,Description_noStop,CourseTitle_lemmatized,Description_lemmatized
0,abts-advanced-topics,BizTalk 2006 Business Process Management,This course covers Business Process Management...,biztalk 2006 business process management,this course covers business process management...,"[biztalk, 2006, business, process, management]","[this, course, covers, business, process, mana...","[biztalk, 2006, business, process, management]","[this, course, covers, business, process, mana...","[biztalk, 2006, business, process, management]","[course, covers, business, process, management...","[biztalk, 2006, business, process, management]","[course, cover, business, process, management,..."
1,abts-fundamentals,BizTalk 2006 Fundamentals,Despite the trend towards service-oriented arc...,biztalk 2006 fundamentals,despite the trend towards service-oriented arc...,"[biztalk, 2006, fundamentals]","[despite, the, trend, towards, service-oriente...","[biztalk, 2006, fundamentals]","[despite, the, trend, towards, service-oriente...","[biztalk, 2006, fundamentals]","[despite, trend, towards, service-oriented, ar...","[biztalk, 2006, fundamental]","[despite, trend, towards, service-oriented, ar..."
2,agile-team-practice-fundamentals,Agile Team Practices with Scrum,This course is much different than most of the...,agile team practices with scrum,this course is much different than most of the...,"[agile, team, practices, with, scrum]","[this, course, is, much, different, than, most...","[agile, team, practices, with, scrum]","[this, course, is, much, different, than, most...","[agile, team, practices, scrum]","[course, much, different, courses, pluralsight...","[agile, team, practice, scrum]","[course, much, different, course, pluralsight,..."
4,aspdotnet-advanced-topics,ASP.NET 3.5 Advanced Topics,This course covers more advanced topics in ASP...,asp.net 3.5 advanced topics,this course covers more advanced topics in asp...,"[asp.net, 3.5, advanced, topics]","[this, course, covers, more, advanced, topics,...","[asp.net, 3.5, advanced, topics]","[this, course, covers, more, advanced, topics,...","[asp.net, 3.5, advanced, topics]","[course, covers, advanced, topics, asp.net, 3....","[asp.net, 3.5, advanced, topic]","[course, cover, advanced, topic, asp.net, 3.5,..."
5,aspdotnet-ajax-advanced-topics,ASP.NET Ajax Advanced Topics,This course covers advanced topics in ASP.NET ...,asp.net ajax advanced topics,this course covers advanced topics in asp.net ...,"[asp.net, ajax, advanced, topics]","[this, course, covers, advanced, topics, in, a...","[asp.net, ajax, advanced, topics]","[this, course, covers, advanced, topics, in, a...","[asp.net, ajax, advanced, topics]","[course, covers, advanced, topics, asp.net, aj...","[asp.net, ajax, advanced, topic]","[course, cover, advanced, topic, asp.net, ajax..."


Preprocessed Data

In [36]:
preprocessed_df = text_data_df[['CourseId','CourseTitle_lemmatized','Description_lemmatized']]
preprocessed_df.head()

Unnamed: 0,CourseId,CourseTitle_lemmatized,Description_lemmatized
0,abts-advanced-topics,"[biztalk, 2006, business, process, management]","[course, cover, business, process, management,..."
1,abts-fundamentals,"[biztalk, 2006, fundamental]","[despite, trend, towards, service-oriented, ar..."
2,agile-team-practice-fundamentals,"[agile, team, practice, scrum]","[course, much, different, course, pluralsight,..."
4,aspdotnet-advanced-topics,"[asp.net, 3.5, advanced, topic]","[course, cover, advanced, topic, asp.net, 3.5,..."
5,aspdotnet-ajax-advanced-topics,"[asp.net, ajax, advanced, topic]","[course, cover, advanced, topic, asp.net, ajax..."
