### Defining the documents 

In [34]:
install.packages("rvest")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [35]:
install.packages("tm")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [36]:
install.packages("SnowballC")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [37]:
doc1 <- "HS650: The Data Science and Predictive Analytics (DSPA) course (offered
as a massive open online course, MOOC, as well as a traditional University of
Michigan class) aims to build computational abilities, inferential thinking, and
practical skills for tackling core data scientific challenges. It explores
foundational concepts in data management, processing, statistical computing, and
dynamic visualization using modern programming tools and agile web-services.
Concepts, ideas, and protocols are illustrated through examples of real
observational, simulated and research-derived datasets. Some prior quantitative
experience in programming, calculus, statistics, mathematical models, or linear
algebra will be necessary. This open graduate course will provide a general
overview of the principles, concepts, techniques, tools and services for
managing, harmonizing, aggregating, preprocessing, modeling, analyzing and
interpreting large, multi-source, incomplete, incongruent, and heterogeneous data
(big data). The focus will be to expose students to common challenges related to
handling big data and present the enormous opportunities and power associated
with our ability to interrogate such complex datasets, extract useful
information, derive knowledge, and provide actionable forecasting. Biomedical,
healthcare, and social datasets will provide context for addressing specific
driving challenges. Students will learn about modern data analytic techniques and
develop skills for importing and exporting, cleaning and fusing, modeling and
visualizing, analyzing and synthesizing complex datasets. The collaborative
design, implementation, sharing and community validation of high-throughput
analytic workflows will be emphasized thorought the course."

doc2 <- " Bioinformatics 501: The Mathematical Foundations for Bioinformatics
course covers some of the fundamental mathematical techniques commonly used in
bioinformatics and biomedical research. These include: 1) principles of multi-
variable calculus, and complex numbers/functions, 2) foundations of linear
algebra, such as linear spaces, eigen-values and vectors, singular value
decomposition, spectral graph theory and Markov chains, 3) differential equations
and their usage in biomedical system, which includes topic such as existence and
uniqueness of solutions, two dimensional linear systems, bifurcations in one and
two dimensional systems and cellular dynamics, and 4) optimization methods, such
as free and constrained optimization, Lagrange multipliers, data denoising using
optimization and heuristic methods. Demonstrations using MATLAB, R, and Python
are included throughout the course."

doc3 <- "HS 853: This course covers a number of modern analytical methods for
advanced healthcare research. Specific focus will be on reviewing and using
innovative modeling, computational, analytic and visualization techniques to
address concrete driving biomedical and healthcare applications. The course will
cover the 5 dimensions of big data (volume, complexity, multiple scales, multiple
sources, and incompleteness). HS853 is a 4 credit hour course (3 lectures + 1
lab/discussion). Students will learn how to conduct research, employ and report
on recent advanced health sciences analytical methods; read, comprehend and
present recent reports of innovative scientific methods; apply a broad range of
health problems; and experiment with real big data. Topics Covered include:
Foundations of R, Scientific Visualization, Review of Multivariate and Mixed
Linear Models, Causality/Causal Inference and Structural Equation Models,
Generalized Estimating Equations, PCOR/CER methods Heterogeneity of Treatment
Effects, big data, Big-Science, Internal statistical cross-validation, Missing
data, Genotype-Environment-Phenotype, associations, Variable selection
(regularized regression and controlled/knockoff filtering), medical imaging,
Databases/registries, Meta-analyses, classification methods, Longitudinal data
and time-series analysis, Geographic Information Systems (GIS), Psychometrics and
Rasch measurement model analysis, MCMC sampling for Bayesian inference, and
Network Analysis"

doc4 <- "HS 851: This course introduces students to applied inference methods in
studies involving multiple variables. Specific methods that will be discussed
include linear regression, analysis of variance, and different regression models.
This course will emphasize the scientific formulation, analytical modeling,
computational tools and applied statistical inference in diverse health-sciences
problems. Data interrogation, modeling approaches, rigorous interpretation and
inference will be emphasized throughout. HS851 is a 4 credit hour course (3
lectures + 1 lab/discussion). Students will learn how to: Understand the
commonly used statistical methods of published scientific papers , Conduct
statistical calculations/analyses on available data , Use software tools to
analyze specific case-studies data , Communicate advanced statistical
concepts/techniques , Determine, explain and interpret assumptions and
limitations. Topics Covered include Epidemiology , Correlation/SLR , and slope
inference, 1-2 samples , ROC Curve , ANOVA , Non-parametric inference ,
Cronbach's $\alpha$, Measurement Reliability/Validity , Survival Analysis ,
Decision theory , CLT/LLNs - limiting results and misconceptions , Association
Tests , Bayesian Inference , PCA/ICA/Factor Analysis , Point/Interval Estimation
(CI) - MoM, MLE , Instrument performance Evaluation , Study/Research Critiques ,
Common mistakes and misconceptions in using probability and statistics,
identifying potential assumption violations, and avoiding them."

doc5 <- "HS550: This course provides students with an introduction to probability
reasoning and statistical inference. Students will learn theoretical concepts and
apply analytic skills for collecting, managing, modeling, processing,
interpreting and visualizing (mostly univariate) data. Students will learn the
basic probability modeling and statistical analysis methods and acquire knowledge
to read recently published health research publications. HS550 is a 4 credit hour
course (3 lectures + 1 lab/discussion). Students will learn how to: Apply data
management strategies to sample data files , Carry out statistical tests to answer common healthcare research questions using appropriate methods and
software tools , Understand the core analytical data modeling techniques and
their appropriate use Examples of Topics Covered , EDA/Charts , Ubiquitous
variation , Parametric inference , Probability Theory , Odds Ratio/Relative Risk
, Distributions , Exploratory data analysis , Resampling/Simulation , Design of
Experiments , Intro to Epidemiology , Estimation , Hypothesis testing ,
Experiments vs. Observational studies , Data management (tables, streams, cloud,
warehouses, DBs, arrays, binary, ASCII, handling, mechanics) , Power, sample-
size, effect-size, sensitivity, specificity , Bias/Precision , Association vs.
Causality , Rate-of-change , Clinical vs. Stat significance , Statistical
Independence Bayesian Rule."

### Creating a new Vcorpus Object 

In [38]:
docs <- c(doc1, doc2, doc3, doc4, doc5)
class(docs) ## [1] "character"

In [39]:
library(tm) 
doc_corpus <- VCorpus(VectorSource(docs))
doc_corpus ## Content : documents : 5
doc_corpus[[1]]$content 

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 5

### Transform Text to LowerCase

In [40]:
doc_corpus <- tm_map(doc_corpus, tolower)
doc_corpus[[1]]

### Text preprocessing 

In [41]:
stopwords("english")

In [42]:
doc_corpus <- tm_map(doc_corpus, stripWhitespace)
doc_corpus[[1]]

In [43]:
doc_corpus <- tm_map(doc_corpus, removePunctuation)
doc_corpus[[2]]

In [44]:
doc_corpus <- tm_map(doc_corpus, PlainTextDocument)
doc_corpus[[1]]$content

In [45]:
library(SnowballC)
doc_corpus <- tm_map(doc_corpus, stemDocument)
doc_corpus[[1]]$content

In [46]:
doc_dtm <- TermDocumentMatrix(doc_corpus)
doc_dtm

<<TermDocumentMatrix (terms: 355, documents: 5)>>
Non-/sparse entries: 530/1245
Sparsity           : 70%
Maximal term length: 27
Weighting          : term frequency (tf)

In [47]:
doc_dtm$dimnames$Docs<-as.character(1:5)
inspect(doc_dtm)

<<TermDocumentMatrix (terms: 355, documents: 5)>>
Non-/sparse entries: 530/1245
Sparsity           : 70%
Maximal term length: 27
Weighting          : term frequency (tf)
Sample             :
         Docs
Terms      1  2  3  4 5
  and     19 12 13 10 7
  cours    4  2  3  3 2
  data     7  1  5  3 6
  infer    0  0  2  6 2
  method   0  2  5  3 2
  model    3  0  4  3 3
  statist  2  0  1  5 4
  the      6  3  2  2 2
  use      2  3  1  3 2
  will     6  0  3  4 3


In [48]:
findFreqTerms(doc_dtm, lowfreq=2)

In [49]:
findAssocs(doc_dtm, "statist", corlimit=0.8)

### Job Ranking 

In [53]:
library(rvest)
job <-
read.csv("/home/jose/Downloads/datasets/DSPA/list-of-200-US-jobs.csv")
head(job)

Unnamed: 0_level_0,Index,Job_Title,Overall_Score,Average_Income.USD.,Work_Environment,Stress_Level,Stress_Category,Physical_Demand,Hiring_Potential,Description
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<chr>
1,1,Software_Engineer,60,87140,150.0,10.4,1,5.0,27.4,Researches_designs_develops_and_maintains_software_systems_along_with_hardware_development_for_medical_scientific_and_industrial_purposes
2,2,Mathematician,73,94178,89.72,12.78,1,3.97,19.78,Applies_mathematical_theories_and_formulas_to_teach_or_solve_problems_in_a_business_educational_or_industrial_climate
3,3,Actuary,123,87204,179.44,16.04,1,3.97,17.04,Interprets_statistics_to_determine_probabilities_of_accidents_sickness_and_death_and_loss_of_property_from_theft_and_natural_disasters
4,4,Statistician,129,73208,89.52,14.08,1,3.95,11.08,Tabulates_analyzes_and_interprets_the_numeric_results_of_experiments_and_surveys
5,5,Computer_Systems_Analyst,147,77153,90.78,16.53,1,5.08,15.53,Plans_and_develops_computer_systems_for_businesses_and_scientific_institutions
6,6,Meteorologist,175,85210,179.64,15.1,1,6.98,12.1,Studies_the_physical_characteristics_motions_and_processes_of_the_earth's_atmosphere


### Step 1 : Make a VCorpus Object 

In [54]:
jobs <- as.list(job$Description)
jobCorpus <- VCorpus(VectorSource(jobs))

### Step 2 : Clean the VCorpus object

In [55]:
jobCorpus <- tm_map(jobCorpus, tolower)
for(j in seq(jobCorpus)) { jobCorpus[[j]] <- gsub("_", "", jobCorpus[[j]]) }

In [56]:
jobCorpus <- tm_map(jobCorpus, removeWords, stopwords("english"))
jobCorpus <- tm_map(jobCorpus, removePunctuation)
jobCorpus <- tm_map(jobCorpus, stripWhitespace)
jobCorpus <- tm_map(jobCorpus, PlainTextDocument)
jobCorpus <- tm_map(jobCorpus, stemDocument)

### Step 3 : Build document - term matrix