# Popular Data Science Questions

In this project, we will be exploring the world of data science and its applications in the business context. The objective is to determine the most appropriate content to create for a company that produces data science materials, including books, online articles, videos, and interactive text-based platforms.

The focus will be on determining what people want to learn about in data science, rather than solely on identifying the most profitable content. To gain insights into the content that resonates with individuals interested in data science, the project will involve investigating popular question and answer websites, such as [Stack Overflow](https://stackoverflow.com/), which is part of the [Stack Exchange](https://en.wikipedia.org/wiki/Stack_Exchange) network.

<img src="https://dq-content.s3.amazonaws.com/469/se-logo.png" width="400" height="400">

## Project overview

Stack Exchange is a platform that offers various websites covering a diverse range of subjects, including data science, mathematics, physics, and philosophy. Here's a sample of the most popular sites:

[<img src="https://dq-content.s3.amazonaws.com/469/se_sites.png" width="600" height="400">](https://stackexchange.com/sites?view=list#percentanswered)


To maintain quality, Stack Exchange has implemented a reputation award system for its questions and answers. Each post is subject to upvotes and downvotes, making it easy to identify high-quality content.

If we are new to Stack Overflow or any other Stack Exchange website, we can take a [tour](https://stackexchange.com/tour) to get a better understanding of the platform.

Several Stack Exchange websites are relevant to our project, including:

- [Data Science](https://datascience.stackexchange.com/)
- [Cross Validated](https://stats.stackexchange.com/), a statistics site
- [Artificial Intelligence](https://ai.stackexchange.com/)
- [Mathematics](https://math.stackexchange.com/)
- [Stack Overflow](https://stackoverflow.com/)

For those interested in including Data Engineering, additional sites to consider are:

- [Database Administrators](https://dba.stackexchange.com/)
- [Unix & Linux](https://unix.stackexchange.com/)
- [Software Engineering](https://softwareengineering.stackexchange.com/)

By accessing the link in the image shared above, we can view a comprehensive list of Stack Exchange websites sorted by the percentage of answered questions. At the time of this writing, Data Science Stack Exchange (DSSE) is among the bottom 10 sites based on this metric.

Despite having a lower percentage of answered questions, DSSE's dedicated focus on data science makes it an ideal candidate for investigation in this project. This will be the primary focus of our project, which marks our first off-Dataquest task.

## Data Science Stack Exchange

If we haven't checked out a Stack Exchange website yet, let's take some time to give one a spin. We can try answering a few questions in a markdown cell, to get the hang of it.

### What kind of questions are welcome on this site?

On the Questions section of DSSE's [help center](https://datascience.stackexchange.com/help/asking), it is advised to:

- Avoid subjective questions
- Ask practical, data science-related questions instead of theoretical ones
- Ask specific questions that are relevant to others

By following these guidelines, you can make your questions more helpful to others.

Additionally, the help center mentions two additional relevant sites:

- [Open Data](https://opendata.stackexchange.com/help/on-topic) (for requests of datasets)
- [Computational Science](https://scicomp.stackexchange.com/help/on-topic) (for information on software packages and algorithms in applied mathematics)

### What, other than questions, does DSSE's home subdivide into?

On the [home page](https://datascience.stackexchange.com/), there are five main sections:

1. [Questions](https://datascience.stackexchange.com/questions): A list of all questions asked
2. [Tags](https://datascience.stackexchange.com/tags): Keywords or labels that categorize questions
3. [Users](https://datascience.stackexchange.com/users): A list of all users
4. [Companies](https://stackoverflow.com/jobs/companies): A directory of tech companies and their open job listings on Stack Overflow Jobs.
5. [Unanswered](https://datascience.stackexchange.com/unanswered): A list of unanswered questions

The Stack Exchange tagging system is useful for our problem as it allows us to see how many questions are asked about each subject.

Another interesting fact about Stack Exchange sites is that they are heavily moderated by the community, giving us confidence in using the tagging system to draw conclusions.

### What information is available in each post?

Each post in Stack Overflow's Data Science section typically contains the following information:

- Problem or question: A clear and concise description of the data science-related problem or question being addressed.
- Code snippets: Relevant code snippets that help to demonstrate a solution to the problem or question.
- Data visualizations: Any relevant data visualizations or graphs that help to illustrate the issue or solution.
- Discussions: Discussions of best practices or techniques related to the topic being addressed.
- Author: The name or username of the person who posted the question or solution.
- Date and time: The date and time the post was created.
- Upvotes and downvotes: A tally of the number of upvotes and downvotes the post has received from other users, which can be used to gauge the popularity and relevance of the post.
- Comments and replies: Any comments or replies from other users, which can provide additional information, clarification, or alternative solutions to the problem or question being addressed.

For instance, we can take a look at the following [post](https://datascience.stackexchange.com/questions/118124/how-to-extract-embeddings-from-an-audio-file-using-wav2vec-along-with-context).

## Stack Exchange Data Explorer

After thoroughly evaluating the website, it becomes evident that utilizing the tags will greatly aid in organizing the content and eliminate the need for manual categorization on our part.

The next challenge is to access the data in bulk. Instead of web scraping, we'll use a more convenient alternative. Stack Exchange offers a public database for each of its sites, including Data Science Stack Exchange. You can find a link to query and explore the database [here](https://data.stackexchange.com/datascience/query/new).

We can learn more about Stack Exchange Data Explorer (SEDE) through its [help section](https://data.stackexchange.com/help) and a tutorial found at the following [link](https://data.stackexchange.com/tutorial).

In the image displayed below, we see the names of each table within the database, and by clicking on the names, we can expand and view the columns within each table.

<img src="https://dq-content.s3.amazonaws.com/469/dsde.png" width="300" height="300">

For instance, the query below selects the top 10 results from a query.

```sql
SELECT TOP 10 *
  FROM tags
 ORDER BY Count DESC;
```

| Id | TagName | Count | ExcerptPostId | WikiPostId | IsModeratorOnly | IsRequired |
|----|---------|-------|--------------|-----------|----------------|-----------|
| 2  | machine-learning | 11011 | 4909  | 4908
| 46 | python | 6521 | 5523 | 5522         |
| 194 | deep-learning | 4776 | 8956 | 8955          |
| 81 | neural-network | 4260 | 8885 | 8884          |
| 77 | classification | 3169 | 4911 | 4910          |
| 324 | keras | 2724 | 9251 | 9250        |
| 47 | nlp | 2536 | 147 | 146           |
| 128 | scikit-learn | 2249 | 5896 | 5895         |
| 321 | tensorflow | 2183 | 9183 | 9182        |
| 72 | time-series | 1801 | 8904 | 8903          |


It's worth mentioning that SEDE utilizes a different dialect, Transact-SQL, compared to the SQLite dialect we previously familiarized ourselves with. While many elements remain similar, there are some distinct differences to be aware of.

In SQLite, instead of using the keyword `TOP` in the `SELECT` statement as in Microsoft's Transact-SQL, we use the keyword `LIMIT` and place it at the end of the query.