Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Partial changes from daisy's comments

  • Loading branch information...
commit 59e7fe5cc1f8501abf9086c4407d480ee964e16d 1 parent bf0e784
@cegme cegme authored
View
25 paper/vldb12/content/abstract.tex
@@ -4,18 +4,27 @@
In many domains, structured data and unstructured text are
important natural resources to fuel data analysis. Statistical
text analysis needs to be performed over the text data to extract
-structured information for further query processing. Typically, developers
-will connect multiple tools to build
+structured information for further query processing.
+Typically, developers will connect multiple tools to build
off-line batch processes to perform analysis tasks. \system is
an integrated system for ad hoc real-time query processing over
-structured and unstructured data. \system is built along side of MADlib,
+structured and unstructured data.
+\eat{\system is built along side of MADlib,
over PostgreSQL and Greenplum DBMS. \system includes a library of
textual analytic functions and it integrates in-database text
-extraction techniques from BayesStore. To illustrate our system, this
+extraction techniques from BayesStore.}
+\system implements in-database textual analytic functions that we have
+submitted to the madlib open source library for textual analytics.
+We show declarative processing of queries involving analysis
+of structured and textual data.
+
+
+
+\eat{To illustrate our system, this
demonstration uses two application domains---computation journalism
-and political campaign management---to show (1) declarative processing of ad
-hoc queries involving statistical text analysis;
-(2) joining between structured and textual data; and (3) query-time rendering of
-visualizations over the query result.
+and political campaign management---to show (1) declarative processing
+of ad hoc queries involving statistical text analysis;
+(2) joining between structured and textual data; and (3) query-time
+rendering of visualizations over the query result.}
\end{abstract}
View
25 paper/vldb12/content/introduction.tex
@@ -2,9 +2,10 @@
\section{Introduction}
% increasing amounts of unstructured text data
-The field of database management has traditionally focused on
+\eat{The field of database management has traditionally focused on
structured data, providing little or no help for the significantly
-larger amounts of the world's data that is unstructured. For many
+larger amounts of the world's data that is unstructured.}
+For many
applications, unstructured text and structured data are both
important natural resources to fuel data analysis. For example, a
sports journalist covering NFL (National Football
@@ -14,17 +15,20 @@ \section{Introduction}
unstructured tweets, blogs, and news about the games.
In such applications, analytics are performed over text data
-from many sources. Text analysis uses the state-of-the-art
+from many sources. Text analysis uses
statistical machine learning (SML) methods to extract structured
-information, such as entities, relations, sentiments, topics, from
+information, such as part-of-speech tags, entities, relations,
+sentiments, topics, from
text. The result of the text analysis can be joined with other
structured data sources to perform analysis. For example, the sports
-journalist may want to correlate fan sentiment from tweets with stats
+journalist may want to correlate fan sentiment from tweets
+with stats \ceg{We need to be more specific here}
of the Miami Dolphins\footnote{http://www.miamidolphins.com/}.
-To our knowledge, there is no integrated system with a query
+\eat{To our knowledge, there is no integrated system with a query
interface that enables domain experts (e.g., journalists) to issues
-such ad hoc exploratory queries. To answer such queries, a software
+such ad hoc exploratory queries.}
+To answer such queries, a software
developer is needed to understand and connect multiple tools,
including Lucene for text search, Weka or MATLAB for sentiment
analysis, and a database for joining the structured data with the
@@ -51,7 +55,8 @@ \section{Introduction}
% real-time analysis enable exploratory queries and query refinement based on result
% many applications: computational journalism, campaign management, e-discovery, etc.
-In this demonstration paper, we describe \system, a library and integrated
+\eat{
+We describe \system, a library and integrated
system that supports both relational query processing and
statistical analysis over text. \system is implemented alongside MADLib,
a collection of libraries developed on
@@ -107,10 +112,12 @@ \section{Introduction}
algorithms using native database techniques. Additionally, we
implementation is over PostgreSQL and Greenplum parallel databases,
demonstrating the scalability of our system.}
+}
+
In the demonstration of \system, we will show the
following points using football journalism as our driving example:
-\begin{itemize}
+\begin{itemize}[noitemsep]
\item declarative processing of ad hoc queries
involving statistical text analysis, such as sentiment analysis,
information extraction, and entity resolution;
View
4 paper/vldb12/content/new_systemdemonstration.tex
@@ -1,6 +1,6 @@
\section{Text Analysis Queries and Demonstration}
Our demonstration will illustrate the following points:
-\begin{enumerate}
+\begin{enumerate}[noitemsep]
\item The ability to perform statistical text analytics inside the DBMS
\item Query driven computation of analytics
\item The flexibility of our data sources by using structured data and text
@@ -197,7 +197,7 @@ \subsection{Text Analytics Queries}
\subsection{User Interface}
-During the conference, we plan to give an interactive demonstration of the
+We plan to give an interactive demonstration of the
{\system}'s capabilities. The demonstration will be based around MADden UI,
a web interface that allows users to perform analytic tasks on our dataset.
MADden UI has two forms of interaction: raw SQL queries, and a Mad
View
3  paper/vldb12/content/relatedwork.tex
@@ -16,6 +16,7 @@ \section{Related Work}
Browse Query Language (BQL\footnote{\url{http://senseidb.github.com/sensei/bql.html}}) developed for SensiDB and
TweeQL \cite{Marcus:2012:PVD:2094114.2094120} are custom SQL
implementations developed especially for text search.
+\eat{
BQL is an SQL-like interface to a distributed database back end that is highly
specialized for text search. Unlike common NoSQL systems, SenseiDB is able to
perform joins but lacks transactions.
@@ -24,7 +25,7 @@ \section{Related Work}
sophisticated UDFs, complex data types,
a query optimizer other basic relational operations.
Unfortunately, the system does not have the same computational power as RDBMS.
-
+}
SystemT is an information extraction system that provides declarative information
extraction across text documents.
A SIGMOD tutorial gives an in-depth survey of similar systems
View
2  paper/vldb12/content/systemdescription.tex
@@ -134,7 +134,7 @@ \subsection{Implementation Details}
Core to many natural language processing tasks, POS involves the
labeling of terms within text based on their function in a particular sentence.
-\system uses our own implementation of POS in Postgres and Greenplum. We are
+We implemented POS tagging in Postgres and Greenplum. We are
committing our code to MADLib and it is in under review.
\system uses first order chain conditional random field to model the labeling
View
3  paper/vldb12/madden.tex
@@ -6,6 +6,7 @@
\usepackage{ulem}
\usepackage{alltt}
\usepackage{hyperref}
+\usepackage{enumitem}
\newcommand{\ceg}[1]{{\textcolor{blue}{#1 -- CEG}}}
\newcommand{\jag}[1]{{\textcolor{red}{#1 -- JAG}}}
@@ -19,7 +20,7 @@
\begin{document}
\title{\system: Query-Driven Statistical Text Analysis}
-
+\subtitle{A Case Study in Computational Journalism}
\author{
Please sign in to comment.
Something went wrong with that request. Please try again.