# Popular Data Science Questions
In this project, we assume the role of Data Analysts working in a company that creates Data Science content. The content can be in form of books, online articles, videos or interactive text. We have been tasked with figuring out the best content to write about for maximum user engagement.

To address this task, we decide to collect data from **[stack exchange](https://en.wikipedia.org/wiki/Stack_Exchange)**, a network of question-and-answer websites. Stack Exchange hosts sites on a multitude of fields and subjects, including Mathematics, Physics, Statistics, and Data Science! Our rationale is simple: If we can identify Data Science questions that people commonly ask, then we can tailor our content to address those topics.

## Why Stack Exchange?
---
1. **Relevance:** Data Science is a multidisciplinary field and Stack Exchange provides a couple of websites that are relevant to our goal. Examples include:
>- [Data Science](https://datascience.stackexchange.com/)
>- [Cross Validated](https://stats.stackexchange.com/) — a statistics site
>- [Artificial Intelligence](https://ai.stackexchange.com/)
>- [Mathematics](https://math.stackexchange.com/)
>- [Stack Overflow](https://stackoverflow.com/)
  
2. **It is easy to identify good posts**: Stack Exchange uses a reputation award system for its questions and answers. Each post is subject to upvotes and downvotes. This makes it easy to identify post that users engage and love.
3. **Established posting guidelines**: Stack Exchange's [Data Science help center](https://datascience.stackexchange.com/help/asking) states that questions should be objective, practical about Data Science, specific and relevant to other users.

The combination of these attributes make Stack Exchange a good source of data for our needs.

## Data Science Stack Exchange
---
Contrary to other Stack Exchange sites, Data Science Stack Exchange (DSSE) is specialized exactly on Data Science. DSSE also has a high percentage of unanswered questions. In fact, a complete list of Stack Exchange websites can be found [here](https://stackexchange.com/sites?view=list#percentanswered), sorted based on the proportion of answered questions. At the time of this writing, Data Science Stack Exchange (DSSE) is one of the bottom 10 sites, having only has 65% of its questions answered. These qualities make DSSE the ideal candidate for our investigation.

_**The DSSE homepage**_<br><br>
<img src='./images/dsse_site.png'>

The left navigation bar on the DSSE hompage comprises five options:
>- [Questions](https://datascience.stackexchange.com/questions) — a list of all questions on the site. Each question contains information on the number of votes, views and answers provided, among many others.
>- [Tags](https://datascience.stackexchange.com/tags) — a list of keywords or labels that categorize questions.<br><br>
<img src='./images/dsse_tags.png'><br><br>
>- [Users](https://datascience.stackexchange.com/users) — a list of users.<br><br>
<img src='./images/dsse_users.png'><br><br>
>- [Companies](https://stackoverflow.com/jobs/companies) — a list of companies hiring tech professionals.<br><br>
<img src='./images/dsse_companies.png'><br><br>
>- [Unanswered](https://datascience.stackexchange.com/unanswered) — a list of unanswered questions.
  
After exploring the website, it is clear that the **tags** will be very useful in categorizing content, saving us the trouble of doing it ourselves. The tags can also help us identify how many questions are asked about each subject.

## SEDE: The Stack Exchange Data Explorer
---
Stack Exchange provides a public database for each of its websites. We can use [this link](https://data.stackexchange.com/datascience/query/new) to query and explore the Data Science Stack Exchange database for information on posts. It is important to note that the database is designed to be queried with the [Transact-SQL (Microsoft's SQL)](https://docs.microsoft.com/en-us/sql/t-sql/language-reference?view=sql-server-ver16) dialect.

After exploring the database, we found a few tables that seem relevant to our analysis:
>- **Posts:** Contains comprehensive information about posts, including the creation date, tags, number of answers, views and upvotes among many more.
>- **Tags:** Holds information about different tags including the number of times they have been used on the site. However, it does not provide time-series information to help us identify if a tag was popular in the past or present.
>- **PostTags:** Contains information on posts and their tags alone. Similar to the Tags table, time series information is absent.
>- **TagSynonyms:** Provides information on tags and alternative names that have been assigned to them by site administrators. Time series information is absent.

Given the absence of time-series information in the **Tags**, **PostTags** and **TagSynonyms** table, and considering that the **Posts** table already contains the relevant information about tags, we will use the information in the posts table alone.

## The Posts Table
---
The Posts Table has **23 columns**. We will focus only on those that seem relevant to our goal:

> (1.) **Id:** An identification number for the post.<br>
> (2.) **PostTypeId:** An identification number for the type of post. The **eight** unique post types are shown below:<br><br>
<img src='./images/post_types.png'><br><br>
> (3.) **CreationDate:** The date and time of creation of the post.<br>
> (4.) **Score:** The post's score.<br>
> (5.) **ViewCount:** How many times the post was viewed.<br>
> (6.) **Tags:** What tags were used.<br>
> (7.) **AnswerCount:** How many answers the question got (only applicable to question posts).<br>
> (8.) **FavoriteCount:** How many times the question was favored.

We are primarily interested with posts that are questions. Other post types are not relevant at the moment. Before proceeding, we can check how many posts on the site are questions, relative to other posts. 

_**The query below (run on Tuesday, August 23, 2022):**_

```
       SELECT PostTypes.Name AS post_type,
              COUNT(*) AS num_posts
         FROM Posts
        INNER JOIN PostTypes
           ON Posts.PostTypeId = PostTypes.Id
        GROUP BY PostTypes.Name
        ORDER BY 2 DESC;

```

_**Yields the following result:**_<br>
<img src='./images/num_posts.png'
     height = 300
     width = 300/>
     
It is clear that due to their low volume, anything that isn't questions or answers is mostly inconsequential. We have **34,118** records as questions. Since we're only interested in recent posts, we will limit our analysis to **question posts from January 2021 till date (August 23, 2022)**.

## Getting the Data
---