/
getting-started.html
136 lines (124 loc) · 8.71 KB
/
getting-started.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Getting Started with sparkhaven. sparkhaven</title><!-- jquery --><script src="https://code.jquery.com/jquery-3.1.0.min.js" integrity="sha384-nrOSfDHtoPMzJHjVTdCopGqIqeYETSXhZDFyniQ8ZHcVy08QesyHcnOUpMpqnmWq" crossorigin="anonymous"></script><!-- Bootstrap --><script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script><!-- Font Awesome icons --><link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1" crossorigin="anonymous"><!-- pkgdown --><link href="../pkgdown.css" rel="stylesheet"><script src="../pkgdown.js"></script><!-- mathjax --><script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]--></head><body>
<div class="container">
<header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="../index.html">sparkhaven</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav"><li>
<a href="../index.html">Home</a>
</li>
<li>
<a href="../reference/index.html">Reference</a>
</li>
<li>
<a href="../articles/index.html">Articles</a>
</li>
</ul><ul class="nav navbar-nav navbar-right"><li>
<a href="https://github.com/emaasit/sparkhaven">
<span class="fa fa-github fa-lg"></span>
</a>
</li>
</ul></div><!--/.nav-collapse -->
</div><!--/.container -->
</div><!--/.navbar -->
</header><div class="row">
<div class="col-md-9">
<div class="page-header toc-ignore">
<h1>Getting Started with sparkhaven</h1>
<h4 class="author">Daniel Emaasit</h4>
<h4 class="date">2016-10-05</h4>
</div>
<div id="sparkhaven-read-sas-spss-stata-data-files-into-spark-dataframes" class="section level1">
<h1>sparkhaven: Read SAS, SPSS, & STATA data files into Spark DataFrames</h1>
<div id="what-is-sparkhaven" class="section level2">
<h2>What is sparkhaven?</h2>
<p>sparkhaven is an extension for sparklyr to read SAS, SPSS, & STATA data files into Spark DataFrames. It uses different Spark packages to load such datasets in parallel in Spark.</p>
<p><strong>Currently, there’s functionality for SAS only. SPSS & STATA will come shortly. Submit a pull request if interested in contributing</strong>.</p>
</div>
<div id="installation" class="section level2">
<h2>Installation</h2>
<p>sparkhaven requires the sparklyr package to run</p>
<div id="install-sparklyr" class="section level3">
<h3>Install sparklyr</h3>
<p>I recommend the latest stable version of sparklyr available on CRAN</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">install.packages</span>(<span class="st">"sparklyr"</span>)</code></pre></div>
</div>
<div id="install-sparkhaven" class="section level3">
<h3>Install sparkhaven</h3>
<p>Install the development version of sparkhaven from this Github repo using devtools</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(devtools)
devtools::<span class="kw">install_github</span>(<span class="st">"emaasit/sparkhaven"</span>)</code></pre></div>
</div>
</div>
<div id="connecting-to-spark" class="section level2">
<h2>Connecting to Spark</h2>
<p>If Spark is not already installed, use the following sparklyr command to install your preferred version of Spark:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(sparklyr)
<span class="kw">spark_install</span>(<span class="dt">version =</span> <span class="st">"2.0.0"</span>)</code></pre></div>
<p>The call to will make the sparkhaven functions available on the R search path and will also ensure that the dependencies required by the package are included when we connect to Spark.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(sparkhaven) </code></pre></div>
<p>We can create a Spark connection as follows:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">sc <-<span class="st"> </span><span class="kw">spark_connect</span>(<span class="dt">master =</span> <span class="st">"local"</span>)</code></pre></div>
</div>
<div id="reading-sas-files" class="section level2">
<h2>Reading SAS files</h2>
<p>sparkhaven provides the function <code>spark_read_sas</code> to read SAS data files in .sas7bdat format into Spark DataFrames. It uses a Spark package called spark-sas7bdat. Here’s an example.</p>
<p>In the example below, we read a sas data file called mtcars.sas7bdat into a table called sas_table in Spark.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">mtcars_file <-<span class="st"> </span><span class="kw">system.file</span>(<span class="st">"extdata"</span>, <span class="st">"mtcars.sas7bdat"</span>, <span class="dt">package =</span> <span class="st">"sparkhaven"</span>)
mtcars_df <-<span class="st"> </span><span class="kw">spark_read_sas</span>(sc, <span class="dt">path =</span> mtcars_file, <span class="dt">table =</span> <span class="st">"sas_example"</span>)
mtcars_df</code></pre></div>
<p>The resulting pointer to a Spark table can be further used in dplyr statements.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(dplyr)
mtcars_df %>%<span class="st"> </span><span class="kw">group_by</span>(cyl) %>%
<span class="st"> </span><span class="kw">summarise</span>(<span class="dt">count =</span> <span class="kw">n</span>(), <span class="dt">avg.mpg =</span> <span class="kw">mean</span>(mpg), <span class="dt">avg.displacment =</span> <span class="kw">mean</span>(disp), <span class="dt">avg.horsepower =</span> <span class="kw">mean</span>(hp))</code></pre></div>
</div>
<div id="reading-spss-files" class="section level2">
<h2>Reading SPSS files</h2>
<p><strong>Coming soon!</strong></p>
</div>
<div id="reading-stata-files" class="section level2">
<h2>Reading STATA files</h2>
<p><strong>Coming soon!</strong></p>
</div>
<div id="logs-disconnect" class="section level2">
<h2>Logs & Disconnect</h2>
<p>Look at the Spark log from R:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">spark_log</span>(sc, <span class="dt">n =</span> <span class="dv">100</span>)</code></pre></div>
<p>Now we disconnect from Spark:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">spark_disconnect</span>(sc)</code></pre></div>
</div>
<div id="acknowledgements" class="section level2">
<h2>Acknowledgements</h2>
<p>Thanks to RStudio for the sparklyr packages that provides functionality to create Extensions.</p>
</div>
</div>
</div>
<div class="col-md-3 hidden-xs">
<div id="tocnav">
<h2>Contents</h2>
<ul class="nav nav-pills nav-stacked"><li><a href="#sparkhaven-read-sas-spss-stata-data-files-into-spark-dataframes">sparkhaven: Read SAS, SPSS, & STATA data files into Spark DataFrames</a><ul class="nav nav-pills nav-stacked"><li><a href="#what-is-sparkhaven">What is sparkhaven?</a></li>
<li><a href="#installation">Installation</a></li>
<li><a href="#connecting-to-spark">Connecting to Spark</a></li>
<li><a href="#reading-sas-files">Reading SAS files</a></li>
<li><a href="#reading-spss-files">Reading SPSS files</a></li>
<li><a href="#reading-stata-files">Reading STATA files</a></li>
<li><a href="#logs-disconnect">Logs & Disconnect</a></li>
<li><a href="#acknowledgements">Acknowledgements</a></li>
</ul></li>
</ul></div>
</div>
</div>
<footer><p>Built by <a href="http://hadley.github.io/pkgdown/">pkgdown</a>. Styled with <a href="http://getbootstrap.com">Bootstrap 3</a>.</p>
</footer></div>
</body></html>