## Data collection

*Document your data collection process and the properties of the data here. Implement, using Python code, to load and preprocess your selected dataset.*

To acquire a large number of Stack Overflow questions regarding C# you can query the official Stack Overflow database with SQL queries at https://data.stackexchange.com/stackoverflow/queries. These queries can only return 50 000 results at a time so multiple queries have to be made to get all the questions from within a specific period of time. The time frame chosen in this report was 2019-09-22 to 2020-11-08. This period of time was chosen since this is the period of time that C# 8 was the current release of C# and the assumption here was that the documentation regarding new features could be the most lacking.

The query used is as follows: <br>
```
SELECT * FROM posts WHERE Tags LIKE '%c#%' AND posts.CreationDate < 'Insert start data here' AND posts.CreationDate > 'Insert end date here'
ORDER BY posts.CreationDate desc
```

Since this operation has to be done in multiple queries the date input will vary but in the end, you will end up with a number of .CSV files containing the data. In our case, we ended up with three files containing 117 527 questions. These files were then combined into one which is the Data.csv file that you can see in the following steps.

### Read the data from the file and output some values to make sure the data read is correct. 

In [152]:
# Importing the necessary libraries
import pandas as pd
from collections import Counter
import textmining as tm
import operator
import nltk
import matplotlib
from sklearn.feature_extraction.text import CountVectorizer
# Ignore warnings for a specific error that we can ignore in this application
pd.options.mode.chained_assignment = None  # default='warn'

In [108]:
# Read our data from the data.csv file
stackOverflowData = pd.read_csv(
    './Data.csv',
    encoding='utf-8'
)

In [109]:
# Print the size of the dataset read
print('Number of Rows and Columns:')
print(stackOverflowData.shape)

Number of Rows and Columns:
(117527, 23)


In [110]:
# Print the column names of the data
listOfColumnNames = stackOverflowData.columns.values.tolist()
print('Column names:')
for name in listOfColumnNames:
    print(name)

Column names:
Id
PostTypeId
AcceptedAnswerId
ParentId
CreationDate
DeletionDate
Score
ViewCount
Body
OwnerUserId
OwnerDisplayName
LastEditorUserId
LastEditorDisplayName
LastEditDate
LastActivityDate
Title
Tags
AnswerCount
CommentCount
FavoriteCount
ClosedDate
CommunityOwnedDate
ContentLicense


In [111]:
# Print the dates from wich this data is produced
minValue = stackOverflowData['CreationDate'].min()
maxValue = stackOverflowData['CreationDate'].max()
print('Dates from wich the data is produced: ' +
      minValue + ' to ' + maxValue)

Dates from wich the data is produced: 2019-09-22 00:02:14 to 2020-11-08 04:47:27


In [112]:
stackOverflowData = stackOverflowData[stackOverflowData['ClosedDate'].isnull()]
stackOverflowData = stackOverflowData[stackOverflowData['Score']>=0]
stackOverflowData.shape

(98705, 23)

### The next step is to read the data into a Data Frame and prepare the data for the analysis phase

In [113]:
# Creating a Data Frame with only the necessary columns
df = stackOverflowData[['AcceptedAnswerId', 'Title',
                    'CreationDate', 'Body']]

In [114]:
# This is not implemented yet but here we will remove all rows witch have an accepted answer
#df = df.loc[df['AcceptedAnswerId'] != '']

In [115]:
# Setting the neccessary data to lowercase 
df['Body'] = df['Body'].str.lower()

In [116]:
# List of words that we want to remove from the dataset 
# We need a method to remove html tags like <p> that are right next to words without a space in between
stopWords = ['code', 'gt', 'i', '<p>i']

# Remove all the stopwords from the data
df['Body'] = df['Body'].apply(lambda x: ' '.join(
    [word for word in x.split() if word not in (stopWords)]))

# Print to test the remove stopwords function
print('Testa StopWords funktionen: \n')
print(df['Body'].head(4) + '\n')
df['Body'].shape

Testa StopWords funktionen: 

0    use load / unload assembly to get all types th...
1    cannot figure out what isnt being set for the ...
2    need the following data</p> <p><a href="https:...
3    <pre><code>using system; using system.collecti...
Name: Body, dtype: object


(98705,)

In [194]:
whiteList = pd.read_csv(
    './csharp-topics.csv',
    encoding='utf-8')
mydic = {}
whiteList['C# documentation'] = whiteList['C# documentation'].str.lower()
for topic in whiteList['C# documentation']:
    mydic[topic] = df['Body'].str.contains(topic, regex=False).sum()

In [198]:
mydic

{'get started': 70,
 'introduction to the c# language and .net': 0,
 'tour of c#': 0,
 'introduction': 81,
 'types': 4902,
 'program building blocks': 0,
 'major language areas': 0,
 'tutorials': 665,
 'overview': 329,
 'introduction to programming with c#': 0,
 'choose your first lesson': 0,
 'hello world': 568,
 'numbers in c#': 2,
 'branches and loops': 0,
 'list collections': 2,
 'work in you local environment': 0,
 'set up your environment': 0,
 'introduction to classes': 0,
 'object-oriented programming': 2,
 'explore record types': 0,
 'explore top level statements': 0,
 'explore patterns in objects': 0,
 'explore c# 6': 0,
 'explore string interpolation - interactive': 0,
 'explore string interpolation - in your environment': 0,
 'advanced scenarios for string interpolation': 0,
 'safely update interfaces with default interface methods': 0,
 'create mixin functionality with default interface methods': 0,
 'explore indexes and ranges': 0,
 'work with nullable reference types': 0

In [207]:
dataframe = pd.DataFrame.from_dict(mydic, orient = 'index', columns=['Count'])
dataframe

Unnamed: 0,Count
in,98293.0
on,96947.0
is,95290.0
<code>,85367.0
do,75352.0
...,...
<summary>,0.0
<seealso>,0.0
<see>,0.0
<returns>,0.0


## Data analysis

*Document you choice and motivation for selected data mining method(s) here. Choose a data mining method(s) to use in Python code to perform an analysis of your chosen dataset. Describe why you chose the method(s) and what interesting things you have found from the analysis.*

*Replace the contents of this cell with your own text.*

In [None]:
# Create a list of the top appearing words. The nrOfWords variable defines how many words the list should contain.
nrOfWords = 10
rslt = Counter(' '.join(df['Body']).split()).most_common(nrOfWords)

# Print out the list created above.
print('\n')
for word in rslt:
    print('{} = {}'.format(word[1], word[0]))

## Evaluation of results

*Document an evaluation your analysis results and describe how potentially actionable they are.*

*Replace the contents of this cell with your own text.*

In [None]:
# Add your own code

## Schedule and description of project plan

*Rough schedule for the project beyond the pilot study presented in 3-5. This does not have to be advanced, you can simply provide an estimate based upon reported schedules for similar projects in the literature.*

*Replace the contents of this cell with your own text.*

## Ethical aspects that need to be considered

*Are there ethical aspects that need to be considered? Are there legal implications (e.g., personal data / GDPR)? Are there implications if the case organization is a business, public authority, or nonprofit entity?*

*Replace the contents of this cell with your own text.*