# Popular Data Science Questions

## Introduction

In this scenario, we're working for a company that creates data science content.

Our goal is to explor and use [Data Science Stack Exchange](https://datascience.stackexchange.com/) to determine what content should a data science education company create.

## Stack Exchange

Stack Exchange hosts sites containing questions and answers on a variety of topics and disciplines, for example: math, languages, philosophy, etc. Each site covers a specific topic where questions, answers and users are subject to a reputation process which allows sites to moderate themselves. 

Data Science Stack Exchange (DSSE) is a site for Data Science professionals, Machine Learning specialists and those interested in furthering their knowledge in the field.

### What kind of questions are welcome on this site?

On [DSSE's help center's](https://datascience.stackexchange.com/help) section on questions, we can find information that we should:
- ask practical, answerable questions about Data Science — there are adequate sites for theoretical questions,
- ask questions within a reasonably scoped,
- avoid asking subjective questions, unles this are constructive, such as:
    - inspire answers that explain “why” and “how,
    - have a constructive, fair, and impartial tone,
    - invite sharing experiences over opinions,
    - insist that opinion be backed up with facts and references,
- ask specific questions,
- make questions relevant to others.

All of these characteristics should be helpful attributes to our goal.

In the help center we also learned that there are other three sites that are relevant:
- [Open Data](https://opendata.stackexchange.com/help/on-topic) (Dataset requests)
- [AI](https://ai.stackexchange.com/help/on-topic) (Artificial Intelligence concepts and social implications)
- [Computational Science](https://scicomp.stackexchange.com/help/on-topic) (Software packages and algorithms in applied mathematics)


### What, other than questions, does the site's [home](https://datascience.stackexchange.com/) subdivide into?

On the home page we can see that we have five sections:
- Questions — a list of all questions asked;
- Tags — a list of tags (keywords or labels that categorize questions);
- Users — a list of users;
- Companies - a list of resgistered companies;
- Unanswered — a list of unanswered questions;

On the right panel, we can find the list of the hyperlinked "hot" network questions, so we can also know the trending of questions in the whole Stack Exchange websites.

Tagging system used by Stack Exchange allow us to quantify how many questions are asked about each subject.

By exploring the help center, we can also learn that Stack Exchange's sites are heavily moderated by the community.

### What information is available in each post?

Looking, for example, at [this question](https://datascience.stackexchange.com/questions/66059/how-to-sort-a-multi-level-pandas-data-frame-by-a-particular-column), some of the information we see is:
- For both questions and answers:
    - the posts's score;
    - the posts's title;
    - the posts's author;
    - the posts's body;
- For questions only:
    - the last time the question as active;
    - how many times the question was viewed;
    - related questions;
    - the question's tags;

## Stack Exchange Data Explorer

Stack Exchange Data Explorer (SEDE) is the weblink that store public data bases for each of Stack Exchange's' websites. We can query and explore Data Science Stack Exchange's database by clicking [here](https://data.stackexchange.com/datascience/query/new).

After reviewing Data Science Stack Exchange's database, a few table names looks relevant for our goal:
- Posts
- PostTags
- Tags
- TagSynonyms

Running a few exploratory queries, leads us to focus our efforts on Posts table for now and extend it for another tabels if needed. We also decide to restrict our data to the period of 2019 but supplement with data from other years as the needs arise.

### Getting the Data

We first went to SEDE Query Data Science page to run the following query.
```
SELECT Id, CreationDate,
       Score, ViewCount, Tags,
       AnswerCount, FavoriteCount
  FROM posts
 WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;
```
The resulting data was downloaded and saved as `2019_questions.csv`

### Exploring the Data

In [1]:
# Importing all needed libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Reading dataset
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

In [4]:
questions

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,44419,2019-01-23 09:21:13,1,21,<machine-learning><data-mining>,0,
1,44420,2019-01-23 09:34:01,0,25,<machine-learning><regression><linear-regressi...,0,
2,44423,2019-01-23 09:58:41,2,1651,<python><time-series><forecast><forecasting>,0,
3,44427,2019-01-23 10:57:09,0,55,<machine-learning><scikit-learn><pca>,1,
4,44428,2019-01-23 11:02:15,0,19,<dataset><bigdata><data><speech-to-text>,0,
5,44430,2019-01-23 11:13:32,0,283,<fuzzy-logic>,1,
6,44432,2019-01-23 11:17:46,1,214,<time-series><anomaly-detection><online-learning>,0,1.0
7,44436,2019-01-23 12:49:39,0,9,<matrix-factorisation>,0,
8,44437,2019-01-23 13:04:11,0,7,<correlation><naive-bayes-classifier>,0,
9,44438,2019-01-23 13:16:29,0,584,<machine-learning><python><deep-learning><kera...,1,


Running `questions.info()` will gives us a lot of useful information.

In [5]:
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


From abowe tabel we can see that our dataset contains 8839 rows and 7 columns. 

The types seem adequate for every column. 

Also we can see that there are missing values only in `FavoriteCount`. A missing value on this column probably means that the question is not present in any users' favorite list, so we can replace the missing values with zero and change dtype for intiger, because there is no reason to store the values as floats.

Only `Tags` column has object as a datatype. Since the object dtype is a catch-all type, we can see what types the objects in `questions["Tags"]` are.

In [9]:
questions['Tags'].apply(type).unique()

array([<class 'str'>], dtype=object)

We see that every value in this column is a string. On Stack Exchange, each question can only have a maximum of five tags (source), so we'll keep them as a list.

### Cleaning the Data