# Popular Data Science Questions

The ACME Corporation is looking for new content regarding our recent decision to create books, on-line articles, or videos in the field of data science.  This analysis will help us determine what is the best content to get started with.  We have determined to analyze the most popular questions on the question and answer on-line forum known as Stack Exhchange.

## Stack Exchange

Stack Exchange is a web site where users can ask and answer questions from other users of the site.  The site is organized various topics, such as Data Science.

The site allows users to rate the quality of the questions asked.  The higher rated questions appear near the top of the list of questions to highlight quality questions.  Users can also rate the answers and the highest rated answer is highlighted after the question.

The [Data Science](https://datascience.stackexchange.com) Stack Exchange web site allows questions and answers for Data Science professionals, Machine Learning specialists, and individuals who wish to learn more about data science.

### What kinds of questions are accepted?

According to the Data Science Stack Exchange help center's [asking questions FAQ](https://datascience.stackexchange.com/help/asking), the following guidelines should be followed:
* Avoid asking subjective questions
* Avoid asking duplicate questions (duplicate questions are flagged accordingly)
* Ask only relavant and appropriate questions in regards to data science
    * For instance, asking questions regarding programming without reference to data is should be asked in the [Stack Overflow](https://stackoverflow.com) forum instead of stack exchange.
* Ask specific questions
* Make questions relavant to others

All of these characteristics should be helpful in our analysis.

The help center's page also refers several sites that may be of relevance to our analysis:
* [Open Data](https://opendata.stackexchange.com/) for dataset requests
* [Computational Science](https://scicomp.stackexchange.com/) for software packages and algorithms in applied mathematics

### How are questions subdivided?

On the [home](https://datascience.stackexchange.com/) page, there are four sections:
* [Questions](https://datascience.stackexchange.com/questions) - list of all questions asked
* [Tags](https://datascience.stackexchange.com/tags) - list of tags (keywords or labels that categorize questions). Users can also assign "Tags" to each question to make it easier to search for and organize questions by topic.
* [Users](https://datascience.stackexchange.com/users) - list of registered users on Stack Exchange
* [Unanswered](https://datascience.stackexchange.com/unanswered) - list of unanswered questions

The tagging system used by Stack Exchange could be helpful in our analysis as it will allow us to quantify how many questions were asked about a particular subject.

The questions in the Stack Exhange's sites are heavily moderated by the community, which will give us confidence in the validity of the tagging system and provide us with accurate conclusions.

### What information is available in each post?

For both questions and answers:
* Rating or score (as a numerical value; higher the number, higher the rating)
* Title
* Author
* Body of post

For questions only:
* Last time the question was active
* How many times the question was viewed
* Related questions
* Tags associated with the question

## Stack Exchange Data Explorer

Stack exchange provides a public interface to query data for any of Stack Exchange's web sites.  [This](https://data.stackexchange.com/datascience/query/new) link provides a means to query and explore Stack Exchange's Data Science database.

Using the Stack Exchange Data Explorer, there are a few tables that may be relevant to our analysis:
* Posts - containing details of the actual post
* PostTags - associate tags with a particular post
* Tags - contains the number of times a tag is used for a particular post
* TagSynonyms - contains the different variations of a particular tag

## Getting the Data

Below are some definitions of the fields in the Posts table that are relevant to our analysis:

* Id : identifier for the post
* PostTypeId : identifier for the type of post
* CreationDate : date and time the post was created
* Score : post's score
* ViewCount : number of times the post was viewed
* Tags : tags used to describe the content of the post
* AnswerCount : number of answers associated with the question post
* FavoriteCount : number of times the question post was favored

We are only interested in recent posts made in 2019 as to analyze current trends in data science questions.  Let's retrieve a sample of the data based on the CreationData year = 2019 and PostTypeId = 1 (1 refers to question type):

```sql
SELECT Id, CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount
FROM posts
WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;
```

Below is a sample of the data returned:

| Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount |
|:--:|:------------:|:-----:|----------:|------|:-----------:|:-------------:|
| 53985 | 2019-06-18 05:07:54 | 1 | 40 | logistic-regressionorangeorange3 | 0 | 0 |
| 53987 | 2019-06-18 06:37:06 | 0 | 13 | neural-networknatural-language-processngrams | 0 | |
| 53989 | 2019-06-18 06:53:20 | 0 | 26 | machine-learningsvmactive-learning | 0 | 1 |
| 53990 | 2019-06-18 06:57:15 | 0 | 25 | machine-learningfeature-selectiondata-cleaningcorrelation | 0 | |
| 53991 | 2019-06-18 06:59:37 | 1 | 138 | clustering | 1 | |



## Exploring the Data

For the purposes of this analysis, we have exported the above query into the file "2019_questions.csv".

Let's load this data into a dataframe so we can manipulate and visualize the data in Python:

In [1]:
# Let's import the Python libraries we will use:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

Let's run [questions.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) and see if it provides any quick insight into the data:

In [3]:
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             8839 non-null   int64         
 1   CreationDate   8839 non-null   datetime64[ns]
 2   Score          8839 non-null   int64         
 3   ViewCount      8839 non-null   int64         
 4   Tags           8839 non-null   object        
 5   AnswerCount    8839 non-null   int64         
 6   FavoriteCount  1407 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB


It looks like only 1407 values are filled in FavoriteCount out of a possible 8839.  It's probably safe to assume that a null value in this fields can count as zero.  The type stored in FavoriteCount is a float.  Again, it's probably safe to assume that the number of favorites probably only have integer values, so we'll change the data type from float to integer.

The data type for Tags shows up as an "object."  Let's investigate further as to what data type the Tags field actually is:

In [4]:
questions["Tags"].apply(lambda x: type(x)).unique()

array([<class 'str'>], dtype=object)

The tags column is a string.  On Stack Exchange, each question can only have maximum of five tags ([source](https://meta.stackexchange.com/a/18879)).  It would be too much trouble to create an individual field for 5 tags, so we'll create a comma separated list for the Tag field.

## Cleaning the Data

Let's first begin by filling in the missing values in the FavoriteCount field and convert it into an integer:

In [6]:
questions.fillna(value={"FavoriteCount" : 0}, inplace=True)
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             8839 non-null   int64         
 1   CreationDate   8839 non-null   datetime64[ns]
 2   Score          8839 non-null   int64         
 3   ViewCount      8839 non-null   int64         
 4   Tags           8839 non-null   object        
 5   AnswerCount    8839 non-null   int64         
 6   FavoriteCount  8839 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB


Now, let's modify the Tags field to make them easier to work with.  Each tag in the Tags field is enclosed in angle brackets with no spacing inbetween tags.  We'll just remove the brackets and save the resulting tags as a Python list object.

In [7]:
questions["Tags"] = questions["Tags"].str.replace("^<|>$", "").str.split("><")
questions.sample(5)

Unnamed: 0,Id,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
1649,57443,2019-08-12 16:30:25,2,149,"[machine-learning, python, feature-selection, ...",1,0
5751,63303,2019-11-17 19:17:10,0,24,"[scikit-learn, tensorflow, pytorch, apache-spark]",0,0
993,45553,2019-02-14 03:24:29,1,300,"[xgboost, ranking, ndcg]",0,0
2648,47242,2019-03-13 15:37:47,1,39,[r],1,0
8110,54680,2019-06-28 10:04:05,0,35,"[machine-learning, neural-network, deep-learni...",0,0


## Most Used and Most Viewed

