In [1]:
%%html
<style>
table {align:left;display:block}  # to align html tables to left
</style>

# Dataquest - Data Analysis In Business <br/> <br/> Project Title: Popular Data Science Questions

## 1) Introduction

#### Background

Provided by: [Dataquest.io](https://www.dataquest.io/)

In this scenario, you're working for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms like Dataquest.

You're tasked with figuring out what is best content to write about. Because you took this course, you know that given the lack of instructions there's some leeway in what "best" means here.

Since you're passionate about helping people learn, you decide to scour the internet in search for the answer to the question "What is it that people want to learn about in data science?" (as opposed to determining the most profitable content, for instance).

Thinking back to your experience when you first started learning programming, it occurs to you that if you wanted to figure out what programming content to write, you could consult [Stack Overflow](https://stackoverflow.com/) (a question and answer website about programming) and see what kind of content is more popular.

You decide to investigate Stack Overflow a little more and find out that it is part of a question and answer website network called [Stack Exchange](https://en.wikipedia.org/wiki/Stack_Exchange).

## 2) Stack Exchange

Provided by: [Dataquest.io](https://www.dataquest.io/)

Stack Exchange hosts sites on a multitude of fields and subjects, including mathematics, physics, philosophy, and [data science](https://datascience.stackexchange.com/)!

Stack Exchange employs a reputation award system for its questions and answers. Each post — each question/answer — is a post that is subject to upvotes and downvotes. This ensures that good posts are easily identifiable.

The fact that DSSE is a data science dedicated site (contrarily to the others), coupled with it having a lot of unanswered questions, makes it an ideal candidate for this investigation. DSSE will be the focus of this guided project.

#### Findings (Stack Exchange):

Tags are categorised in this site which may help us to review better what kind of data science questions are people more interested in.

## 3) Stack Exchange Data Explorer

Provided by: [Dataquest.io](https://www.dataquest.io/)

Stack Exchange provides a public data base for each of its websites. [Here's](https://data.stackexchange.com/datascience/query/new) a link to query and explore Data Science Stack Exchange's database.

More information is available about Stack Exchange Data Explorer (SEDE) on its [help section](https://data.stackexchange.com/help) and on this [tutorial link](https://data.stackexchange.com/tutorial).

We can run SQL queries using SEDE.

Note that SEDE uses a different dialect ([Transact-SQL](https://en.wikipedia.org/wiki/Transact-SQL) — Microsoft's SQL) than SQLite. For instance, the query below selects the top 10 results from a query.

```
SELECT TOP 10 *
  FROM tags
 ORDER BY Count DESC;
```

In SQLite we would not only use the keyword LIMIT instead of TOP we would also included it at the end of the query, instead of in the SELECT statement. [Here's](https://www.mssqltips.com/sqlservertip/4777/comparing-some-differences-of-sql-server-to-sqlite/) a helpful resource for navigating the differences.

#### Findings (Stack Exchange Data Explorer):

A brief search to have a feel of promising popular content on Data Science:

```
SELECT TOP 10 *
  FROM tags
 ORDER BY Count DESC;
```

<img src="top_10.png"> </img>

## 4) Getting The Data

Provided by: [Dataquest.io](https://www.dataquest.io/)

The posts table has a lot of columns. We'll be focusing our attention on those that seem relevant towards our goal:

Columns | Description
--- | ---
Id| An identification number for the post.
PostTypeId| An identification number for the type of post.
CreationDate| The date and time of creation of the post.
Score| The post's score.
ViewCount| How many times the post was viewed.
Tags| What tags were used.
AnswerCount| How many answers the question got (only applicable to question posts).
FavoriteCount| How many times the question was [favored](https://meta.stackexchange.com/questions/53585/how-do-question-bookmarks-work) (only applicable to question posts).


**PostTypeId:**

<img src="post_type_id.png"> </img>

There are eight different types of post. Before we try to figure out which of them are relevant to us, let's check how many of them there are:

```
SELECT PostTypeId, COUNT(*) as NrOfPosts
  FROM posts
 GROUP BY PostTypeId
 ORDER BY 2 DESC;
```

<img src="post_type_query.png" style="width: 300px;"> </img>

Due to their low volume, anything that isn't questions or answers is mostly inconsequential. Even if it happens to be the case that such kind of posts is immensely popular, they would just be outliers and not relevant to us. We'll then just focus on the questions.

For the purpose of this project exercise, we would be only interested in recent posts limited to the posts of 2019.

#### Findings (Getting The Data):

Run a query against the SEDE DSSE database that extracts the columns listed above for all the questions in 2019.

We have downloaded the below query result into a csv file '2019_questions_query.csv' contained in this GitHub folder directory.

```
SELECT 
    Id, PostTypeId, CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount
FROM
    Posts
WHERE
    CreationDate LIKE '%2019%'
    AND
    (PostTypeId = 1 OR PostTypeId = 2);
```

## 5) Exploring The Data:

In [2]:
# Read in the file into a dataframe.
import pandas as pd

df_2019 = pd.read_csv('2019_questions_query.csv', parse_dates=True)
df_2019.head()

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Tags,AnswerCount,FavoriteCount
0,63868,1,2019-11-27 15:46:19,0,17.0,<deep-learning><dataset><image-classification>...,0.0,
1,63869,2,2019-11-27 16:36:39,3,,,,
2,63870,2,2019-11-27 16:52:26,2,,,,
3,63871,2,2019-11-27 16:53:16,0,,,,
4,63872,1,2019-11-27 17:17:53,1,1467.0,<machine-learning><classification><xgboost>,1.0,


In [3]:
# Explore the data.

# review datatypes
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14462 entries, 0 to 14461
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             14462 non-null  int64  
 1   PostTypeId     14462 non-null  int64  
 2   CreationDate   14462 non-null  object 
 3   Score          14462 non-null  int64  
 4   ViewCount      6761 non-null   float64
 5   Tags           6761 non-null   object 
 6   AnswerCount    6761 non-null   float64
 7   FavoriteCount  1693 non-null   float64
dtypes: float64(3), int64(3), object(2)
memory usage: 904.0+ KB


In [4]:
# review null values
print(df_2019.isnull().sum())


Id                   0
PostTypeId           0
CreationDate         0
Score                0
ViewCount         7701
Tags              7701
AnswerCount       7701
FavoriteCount    12769
dtype: int64


#### Findings (Exploring The Data):

There are 14,462 rows and 8 columns.

We may want to modify datatypes to suit our analysis (eg. **CreationDate** as datetime object).

There are 7,701 null values in 3 columns:
- **ViewCount**
- **Tags**
- **AnswerCount**

There are 12,769 null values in column **'FavoriteCount'**.

Tag column values seem to consist of value(s) encased in 'html' tags, ie. <>

## 6) Cleaning The Data:

The data is considered as quite clean, and we only need to modify it slightly to better fit our analysis purposes.

Let's work on the findings above.

In [5]:
# Fill in the missing values with 0.


df_2019 = df_2019.fillna(0)

# review transformation 
df_2019.isnull().sum()

Id               0
PostTypeId       0
CreationDate     0
Score            0
ViewCount        0
Tags             0
AnswerCount      0
FavoriteCount    0
dtype: int64

In [6]:
# setup datatypes for each column

# change to datetime object; input format example: '2019-11-27 15:46:19'
# datetime documentation: https://docs.python.org/3/library/datetime.html
import datetime as dt

def parse_datetime(dt_input):
    output = dt.datetime.strptime(dt_input, '%Y-%m-%d %H:%M:%S')
    return output

df_2019['CreationDate'] = df_2019['CreationDate'].apply(parse_datetime)


# change dtype to int for following columns
cols = ['ViewCount', 'AnswerCount', 'FavoriteCount']
df_2019[cols] = df_2019[cols].astype(int)

# review transformation
print(df_2019.info())
df_2019['CreationDate'].head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14462 entries, 0 to 14461
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             14462 non-null  int64         
 1   PostTypeId     14462 non-null  int64         
 2   CreationDate   14462 non-null  datetime64[ns]
 3   Score          14462 non-null  int64         
 4   ViewCount      14462 non-null  int64         
 5   Tags           14462 non-null  object        
 6   AnswerCount    14462 non-null  int64         
 7   FavoriteCount  14462 non-null  int64         
dtypes: datetime64[ns](1), int64(6), object(1)
memory usage: 904.0+ KB
None


0   2019-11-27 15:46:19
1   2019-11-27 16:36:39
2   2019-11-27 16:52:26
3   2019-11-27 16:53:16
4   2019-11-27 17:17:53
Name: CreationDate, dtype: datetime64[ns]

In [7]:
# replace the tag column content for easier string data manipulation
# ie. change seperators from '<' and '>' to ',' instead

# review values before transformation
print('Before Transformation:\n')
print(df_2019['Tags'].value_counts(), '\n')

# transform
df_2019['Tags'] = df_2019['Tags'].str.replace('<', '').str.replace('>', ',')
df_2019['Tags'] = df_2019['Tags'].fillna('0')  # fill na for this column again

# review transformation
print('After Transformation:\n')
print(df_2019['Tags'].value_counts(), '\n')
print(df_2019.info(), '\n')
print(df_2019.isnull().sum(), '\n')
df_2019['Tags'].head()

Before Transformation:

0                                                          7701
<machine-learning>                                           86
<python><pandas>                                             52
<python>                                                     41
<nlp>                                                        36
                                                           ... 
<genetic-algorithms>                                          1
<cnn><logistic-regression>                                    1
<deep-learning><image-classification><convolution>            1
<machine-learning><python><clustering><k-means><dbscan>       1
<neural-network><deep-learning><nlp>                          1
Name: Tags, Length: 5073, dtype: int64 

After Transformation:

0                                                          7701
machine-learning,                                            86
python,pandas,                                               52
python,         

0    deep-learning,dataset,image-classification,tra...
1                                                    0
2                                                    0
3                                                    0
4             machine-learning,classification,xgboost,
Name: Tags, dtype: object

In [31]:
# get list of available tags from the column 'tags'


# split tags into multiple columns
tags = df_2019['Tags'].str.split(',', expand=True)

# melt the different columns into single column
tags = pd.melt(frame=tags, value_vars=[0, 1, 2, 3, 4, 5])

# drop unneeded column
tags.drop(labels='variable', axis=1, inplace=True)

# review transformation: tags is frequency table
print(tags.value_counts())

tags == 'r'

value            
0                    7701
                     6761
machine-learning     2129
python               1440
deep-learning         895
                     ... 
community               1
domain-adaptation       1
aws-lambda              1
tesseract               1
julia                   1
Length: 490, dtype: int64


Unnamed: 0,value
0,False
1,False
2,False
3,False
4,False
...,...
86767,False
86768,False
86769,False
86770,False


In [9]:
# drop duplicate tags
tag_list = tags.drop_duplicates()

# reset index
tag_list.reset_index(drop=True, inplace=True)

# review transformation: tag_list is a list of unique tags
tag_list

Unnamed: 0,value
0,deep-learning
1,0
2,machine-learning
3,python
4,gradient-descent
...,...
486,normal-equation
487,f1score
488,catboost
489,mean-shift


## 7) Most Used And Most Viewed

Provided by: [Dataquest.io](https://www.dataquest.io/)

We now focus on determining the most popular tags. We'll do so by considering two different popularity proxies: for each tag we'll count how many times the tag was used, and how many times a question with that tag was viewed.

We could take into account the score, or whether or not a question is part of someone's favorite questions. These are all reasonable options to investigate; but we'll limit the focus of our research to counts and views for now.

In [10]:



# review tags: frequency table from previous transformation
# since top 2 values are meaningless, don't include them
print('Count how many times each tag was used:')
print(tags.value_counts()[2:])

Count how many times each tag was used:
value            
machine-learning     2129
python               1440
deep-learning         895
neural-network        819
keras                 707
                     ... 
community               1
domain-adaptation       1
aws-lambda              1
tesseract               1
julia                   1
Length: 488, dtype: int64


In [11]:
# Count how many times each tag was viewed.
# use the list of tag list with unique values from previous transformation

# consider to use regex search using pattern = tag from tag_list, and search source from df_2019
# thereafter incremental the viewcounts if each row's tag value from df_2019 contains that specific tag

import re

def extract_tag_view_count(tag):
    if str(tag) == None:
        pass
    elif str(tag) != None:
        tag_view = df_2019[df_2019['Tags'].str.contains(re.escape(str(tag)))].sum()['ViewCount']
        return tag_view
    else:
        return None

tag_list['tag_views'] = tag_list['value']
tag_list['tag_views'] = tag_list['tag_views'].apply(extract_tag_view_count)
tag_list.sort_values(by='tag_views', ascending=False, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list['tag_views'] = tag_list['value']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list['tag_views'] = tag_list['tag_views'].apply(extract_tag_view_count)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list.sort_values(by='tag_views', ascending=False, inplace=True)


In [12]:
# drop 1st row and last 2 rows as it represents no tags
tag_list.drop(labels=[231, 230, 1], axis=0, inplace=True)
tag_list.reset_index(drop=True, inplace=True)
tag_list['tag_views'] = tag_list['tag_views'].astype(int)


# review transformation
print(tag_list.dtypes)
tag_list

value        object
tag_views     int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list['tag_views'] = tag_list['tag_views'].astype(int)


Unnamed: 0,value,tag_views
0,r,6400648
1,c,5240696
2,python,2788505
3,learning,2620702
4,machine-learning,1875729
...,...,...
483,usecase,28
484,caffe,24
485,relational-dbms,18
486,doc2vec,17


In [30]:
# Count how many times each tag was used.
# combine both tag count and tag views into same dataframe for ease of data visualisation

def extract_tag_post_count(tag):
    if str(tag) == None or str(tag) == 0:
        pass
    elif str(tag) != None:
        tag_count = df_2019[df_2019['Tags'].str.contains(re.escape(str(tag)))].count()[0]
        return tag_count
    else:
        return None
    
tag_list['tag_count'] = tag_list['value']
tag_list['tag_count'] = tag_list['tag_count'].apply(extract_tag_post_count)
tag_list.sort_values(by='tag_count', ascending=False, inplace=True)

# review transformation
print(tag_list.dtypes)
tag_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list['tag_count'] = tag_list['value']


value        object
tag_views     int64
tag_count     int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list['tag_count'] = tag_list['tag_count'].apply(extract_tag_post_count)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tag_list.sort_values(by='tag_count', ascending=False, inplace=True)


Unnamed: 0,value,tag_views,tag_count
0,r,6400648,5943
1,c,5240696,5190
3,learning,2620702,2972
4,machine-learning,1875729,2196
2,python,2788505,1455
...,...,...,...
448,lda-classifier,168,1
412,hashingvectorizer,495,1
415,lime,468,1
301,tesseract,2910,1


In [27]:
# Create visualizations or the top tags of the above results.