# Popular Data Science Questions
---
## Introduction
### What is this project about

In this scenario, we're working for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms like [Dataquest](https://www.dataquest.io/). And as enthusiastic employees we've desided to o scour the internet in search for the answer to the question:
* **What is it that people want to learn about in data science?**

Thinking back to our experience when we first started learning programming, it occurs to us that if we wanted to figure out what programming content to write, we could consult [Stack Overflow](https://stackoverflow.com/) and see what kind of content is more popular. We've decided to investigate Stack Overflow a little more and found out that it is part of a question and answer website network called [Stack Exchange](https://stackexchange.com/).
![Image](https://dq-content.s3.amazonaws.com/469/se_sites.png)

---

Stack Exchange hosts a huge amount of sitess on a multitude of fields and subjects, including mathematics, physics, philosophy, and data science. Just what we are looking for!
Also we can include data engineering as our point of interest. So there are some websites relevant to our goal:
* [Data Science](https://datascience.stackexchange.com/)
* [Cross Validated](https://stats.stackexchange.com/) (a statistics site)
* [Artificial Intelligence](https://ai.stackexchange.com/)
* [Mathematics](https://math.stackexchange.com/)
* [Stack Overflow](https://stackoverflow.com/) itself
* [Database Administrators](https://dba.stackexchange.com/)
* [Unix & Linux](https://unix.stackexchange.com/)
* [Software Engineering](https://softwareengineering.stackexchange.com/)

### A closer look to the Stack Exchange websites
Let's take for example **Stack Overflow** and walk through it a bit just to get you familiar with it. Another sites are similar to Stack Overflow

Basically **it is Q&A website** where everyone welcomes to ask about a wide range of topics in computer programming. If someone runs into a problem, doesn't know exactly how a particular function works, or is struggling to get their code to work they come here.

Due to issues can be related to different programming languages, there is a **tagging system** that helps filter and navigate content. Also there is a **voting system** that allows to rate best questions and answers so their authors can earn reputation points.

And finally there is a **job finding section** where you can try to get youself a new fancy job.

So the question here is the main unit followed with answer and comments. It has rating, views number and tags (usually)

![Image](https://upload.wikimedia.org/wikipedia/commons/d/dd/Stack_Overflow_Home.png)

---

### Stack Exchange Database

Stack Exchange provides a public data base for each of its websites. Here's a link to query and explore [Data Science Stack Exchange's database](https://data.stackexchange.com/datascience/query/new). It uses a Transact-SQL to write queries.

There are a lot of intresting tables like:
* Posts
* Tags
* TagSynonymous
* Comments

## Getting started
### Stack Exchange Data Explorer
Stack Exchange Data Explorer or just SEDE is a great instrument so why wouldn't we use it. We can run a query right there and just download a CSV file that we'll be able to use further.

Let's start with the posts table. We'll be focusing our attention on those that seem relevant towards our goal:

* **Id**: An identification number for the post.
* **PostTypeId**: An identification number for the type of post.
* **CreationDate**: The date and time of creation of the post.
* **Score**: The post's score.
* **ViewCount**: How many times the post was viewed.
* **Tags**: What tags were used.
* **AnswerCount**: How many answers the question got.
* **FavoriteCount**: How many times the question was favored.

Also we will limit ourselves questions asked in 2019.

To get that info from the SEDE we'll run the folllowing query:
```
SELECT p.Id, pt.Name AS PostType, p.CreationDate, p.Score,
       p.ViewCount, p.Tags, p.AnswerCount, p.FavoriteCount
  FROM posts AS p
       LEFT JOIN PostTypes AS pt
       ON p.PostTypeId = pt.Id
 WHERE YEAR(CreationDate) = 2019;
```
Then we'll just download related CSV and read it to the dataframe.

In [18]:
import pandas as pd

stack_2019_df = pd.read_csv("QueryResults.csv")
stack_2019_df.head()

Unnamed: 0,Id,PostType,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,49302,Question,2019-04-15 04:00:33,0,1818.0,<python><statistics>,1.0,
1,49303,Answer,2019-04-15 04:34:33,4,,,,
2,49304,Question,2019-04-15 05:35:40,0,123.0,<logistic-regression>,1.0,
3,49305,Answer,2019-04-15 05:43:03,0,,,,
4,49306,Question,2019-04-15 06:02:12,0,172.0,<nlp><chatbot>,1.0,1.0


### Exploring and cleansing

Nowe we have familiar that we can explore. So let's see what we're dealing with!

Start with overall info.

In [19]:
stack_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14676 entries, 0 to 14675
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             14676 non-null  int64  
 1   PostType       14676 non-null  object 
 2   CreationDate   14676 non-null  object 
 3   Score          14676 non-null  int64  
 4   ViewCount      6773 non-null   float64
 5   Tags           6773 non-null   object 
 6   AnswerCount    6773 non-null   float64
 7   FavoriteCount  1656 non-null   float64
dtypes: float64(3), int64(2), object(3)
memory usage: 917.4+ KB


Thera are a few issues as we can see:
* `CreationDate` column isn't `datetime` type. We should fix it.
* `ViewCount`, `AnswerCount` and `FavoriteCount` columns should be `int64`.
* `ViewCount`, `AnswerCount` and `FavoriteCount` columns have a lot of missing values. We may find a way to fix it aswell.
* `Tags` column aslo has missing values but due to it's not a counting column it will not be easy to fix.

First goes first - let's deal with the `CreationDate` column.

In [24]:
stack_2019_df['CreationDate'] = pd.to_datetime(stack_2019_df['CreationDate'])
stack_2019_df['CreationDate'].head()

0   2019-04-15 04:00:33
1   2019-04-15 04:34:33
2   2019-04-15 05:35:40
3   2019-04-15 05:43:03
4   2019-04-15 06:02:12
Name: CreationDate, dtype: datetime64[ns]

It's done!

Now proceed to the next step - missing values with the counting columns.

`AnswerCount` column is appliable for questions only. It's obvious actually. Why would you answer to the another answer! There are comments for it.

And making favorite answer or comment also isn't convenient because the question is the boss here as we already know. So `FavoriteCount` column used to only question itself.

In this way all `NaN` values should be in these columns only. Let's check these columns for all non-question post types.

In [42]:
stack_2019_df.loc[stack_2019_df['PostType'] != 'Question', ['PostType', 'AnswerCount', 'FavoriteCount']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7903 entries, 1 to 14673
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PostType       7903 non-null   object 
 1   AnswerCount    0 non-null      float64
 2   FavoriteCount  0 non-null      float64
dtypes: float64(2), object(1)
memory usage: 247.0+ KB


Yep, all values are `NaN` in our columns like we've assumed. So it is reasonable to replace them to `0`.

Probably `ViewCount` column uses `NaN` values instead of `0`. We'll check it aswell.

In [36]:
stack_2019_df['ViewCount'].value_counts(dropna=False).sort_index()

8.0            2
9.0            5
10.0           5
11.0           9
12.0           8
            ... 
92580.0        1
106605.0       1
111240.0       1
152559.0       1
NaN         7903
Name: ViewCount, Length: 1899, dtype: int64

There is no `0` values at all so as we assumed `NaN` stands for it.

Now we can easily replace all `NaN`s with `0` and change type to the `int64` at the same time.

In [53]:
stack_2019_df[['ViewCount','AnswerCount', 'FavoriteCount']] = stack_2019_df[
    ['ViewCount', 'AnswerCount', 'FavoriteCount']].fillna(0).astype('int64')

stack_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14676 entries, 0 to 14675
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             14676 non-null  int64         
 1   PostType       14676 non-null  object        
 2   CreationDate   14676 non-null  datetime64[ns]
 3   Score          14676 non-null  int64         
 4   ViewCount      14676 non-null  int64         
 5   Tags           6773 non-null   object        
 6   AnswerCount    14676 non-null  int64         
 7   FavoriteCount  14676 non-null  int64         
dtypes: datetime64[ns](1), int64(5), object(2)
memory usage: 917.4+ KB


### Tags