# Popular Data Science Questions

#### Our goal in this project is to use [Data Science Stack Exchange](https://datascience.stackexchange.com/) to determine what content should a data science education company create, based on interest by subject.

## 1. Stack Exchange

### a. What is Stack Exchange ?

![Stack Exchange](./Images/se_logo.png)

Stack Exchange is a network of community-driven question-and-answer websites covering a wide range of topics. It was created in 2008 by Joel Spolsky and Jeff Atwood with the goal of providing a platform for people to share knowledge and help others solve problems.

The Stack Exchange network is made up of individual websites, each dedicated to a particular topic or field. Some of the most popular sites include Stack Overflow, which focuses on programming and software development; Super User, which covers computer hardware and software; and Mathematics Stack Exchange, which is devoted to mathematics.

Users can ask and answer questions, vote on the quality of questions and answers, and earn reputation points for their contributions. The community is moderated by a team of volunteers who help keep the content accurate, useful, and free from spam and abuse. The Stack Exchange network has become a valuable resource for people seeking answers to technical or specialized questions, and has a reputation for providing high-quality, reliable information.

### b. How does Stack Exchange help us ?

- Provides a platform for sharing knowledge.
- Helps solve problems.
- Provides high-quality, reliable information.
- Builds reputation and expertise.

## 2. Data Science Stack Exchange

### a. What is Data Science Stack Exchange ?

Data Science Stack Exchange is one of the many websites in the Stack Exchange network, and provides a platform for people to ask and answer questions related to `data science`, `statistics`, `data analysis`, and `machine learning`.

![dsse_logo](./Images/dsse_logo.png)

## 3. Data Science Stack Exchange Home Page Explorer

On the [home page](https://datascience.stackexchange.com/) we can see that we have four sections:

- [Questions](https://datascience.stackexchange.com/questions) - a list of all questions asked;

![Questions](./Images/dsse_questions.png)

- [Tags](https://datascience.stackexchange.com/tags) — a list of tags (keywords or labels that categorize questions);

![Tags](./Images/dsse_tags.png)

- [Users](https://datascience.stackexchange.com/users) — a list of users;

![Users](./Images/dsse_users.png)

- [Unanswered](https://datascience.stackexchange.com/unanswered) — a list of unanswered questions;

![Unanswered](./Images/dsse_unanswered.png)

## 4. Data Science Stack Exchange Posts Explorer

in each Post, there is ONE question and possible answer.

Sample Question      |  Sample Answer
:-------------------------:|:-------------------------:
![Sample Question](./Images/dsse_sample_question.png)  |  ![](./Images/dsse_sample_answer.png)

Looking at this example, some of the information we see is:

- For both questions and answers:
    - The posts's score;
    - The posts's title;
    - The posts's author;
    - The posts's body;
- For questions only:
    - How many users have it on their "
    - The last time the question as active;
    - How many times the question was viewed;
    - Related questions;
    - The question's tags;

---

## Data Science Stack Exchange Data Explorer

### 1. Getting the Data

Stack Exchange provides a public data base for each of its websites. [Here](https://data.stackexchange.com/datascience/query/new)'s a link to query and explore Data Science Stack Exchange's database.

To get the relevant data we run the following query.

```sql
SELECT Id, CreationDate,
       Score, ViewCount, Tags,
       AnswerCount, FavoriteCount
  FROM posts
 WHERE PostTypeId = 1
```

Here's what the first few rows look like:

|   Id  	|   CreationDate   	| Score 	| ViewCount 	|                             Tags                             	| AnswerCount 	| FavoriteCount 	|
|:-----:	|:----------------:	|:-----:	|:---------:	|:------------------------------------------------------------	|:-----------:	|:-------------:	|
| 87185 	| 26/12/2020 23:43 	|   0   	|    105    	| `<logistic-regression><numpy>`                                 	|      2      	|               	|               	
| 87186 	| 27/12/2020 01:41 	|   0   	|    3160   	| `<python><deep-learning><nlp><pytorch><bert>`                  	|      3      	|               	|               	
| 87187 	| 27/12/2020 02:29 	|   0   	|     64    	| `<machine-learning><neural-network><learning-rate>`            	|      1      	|               	|               	
| 87188 	| 27/12/2020 03:29 	|   1   	|     96    	| `<machine-learning><deep-learning><nlp><transformer><chatbot>` 	|      2      	|               	|               	
| 87189 	| 27/12/2020 03:50 	|   1   	|     16    	| `<nlp><text-classification><.net>`                             	|      0      	|               	|               	
| 87190 	| 27/12/2020 03:50 	|   0   	|     43    	| `<tensorflow><game>`                                           	|      1      	|               	|               	
| 87193 	| 27/12/2020 05:44 	|   1   	|    390    	| `<scikit-learn><correlation><pca><variance><matrix>`           	|      1      	|               	|               	
| 87194 	| 27/12/2020 06:42 	|   0   	|     90    	| `<python><keras><scikit-learn><feature-scaling>`               	|      2      	|               	|               	
| 87195 	| 27/12/2020 08:38 	|   5   	|    343    	| `<classification><clustering><difference>`                     	|      3      	|               	|               	
| 87198 	| 27/12/2020 09:36 	|   2   	|    364    	| `<clustering><python-3.x><topic-model>`                        	|      1      	|               	|               	
| 87202 	| 27/12/2020 12:51 	|   0   	|    227    	| `<python><pandas>`                                             	|      1      	|               	|               	

### 2. Exploring the Data

In [3]:
# We import everything that we'll use

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
questions = pd.read_csv("all_questions.csv", parse_dates=["CreationDate"])

Running questions.info() should gives a lot of useful information.

In [5]:
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35773 entries, 0 to 35772
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             35773 non-null  int64         
 1   CreationDate   35773 non-null  datetime64[ns]
 2   Score          35773 non-null  int64         
 3   ViewCount      35773 non-null  int64         
 4   Tags           35773 non-null  object        
 5   AnswerCount    35773 non-null  int64         
 6   FavoriteCount  587 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 1.9+ MB
