## COSC2671 Social Media and Network Analytics Workshop Week 2

Jeffrey Chan, RMIT University 2022

Import a number of needed packages and classes.  When you run this the first time and haven't done so before, nltk may download a list of stopwords, and have output to the tune of "Downloading package stopwords to ...".

In [1]:
import json
from argparse import ArgumentParser
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import cross_validate
from sklearn.decomposition import TruncatedSVD
import numpy as np
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Parameters for our Notebook.  Please change as needed if your input is named differently or want to change the parameters max_df and min_df.

In [2]:
inputPostFile='filteredPosts.json'
max_df=1.0
min_df=3

The following command will open the posts, parse out the title and body of question and tags, apply a vectorizer to turn that data into a vector form, then apply a SVM classifier.  Please examine the code carefully, there are comments throughout to help you understand, and additionally look at the worksheet plus ask your friendly demonstrator if unsure about something.  Run the code and examine the output, then refer to the worksheet to see what you need to change.

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

# set of stop words we be using (nltk)
stop_list = stopwords.words('english')

# list of posts
lPosts = []
# list of labels, note the indices of this and lPosts should match.
llLabels = []

# open the input file and parse the posts and tags
with open(inputPostFile, 'r') as f:
    for sLine in f:
        doc = json.loads(sLine)

        # parse post has title, then combine the title and body text together
        if 'Title' in doc:
            post = doc['Title'] + doc['Body']
        else:
            # parse just parse the body
            post = doc['Body']

        lPosts.append(post)

        # split the string of tags into a list of tags then add to llLabels
        llLabels.append(doc['Tags'].split(' '))

    print("llLabels: ", len(llLabels))
    print("lPosts: ", len(lPosts))


# will do tokenisation, convert to lower cases, remove stop words etc, as well as reweighting using TF-IDF scheme
    # and put all of this into a document-word matrix.
    # Look at the documentation for more stuff you can do with this class.
    vectorizer = TfidfVectorizer(min_df=min_df,
                                    stop_words=stop_list,
                                    max_df=max_df,
                                    lowercase=True)


    # Actually build the document-word matrix (X)
    X = vectorizer.fit_transform(lPosts)
    print("Shape of document-word matrix X (posts, words): ", X.shape)

    # pca = TruncatedSVD(n_components=1000)
    # pca.fit(X)
    # newX = pca.transform(X)
    # print("Shape of newX: ", newX.shape)
    # X=newX


# scikit learn has another great class, MultiLabelBinarizer, which constructs binary vectors of multilabelled data
    # that is, for each class there is a corresding entry in the vector, with a value of '1' if a label is present
    # in a document/posts, otherwise '0'
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform(llLabels)
    print("Shape of y: ", y.shape)

    print(lPosts[1], lPosts[1])
    print(llLabels[1])
    print(X[1])
    print(y[1])

# here we can play with a number of different classifiers
    # We have started with a SVM with a linear kernel
    classifier = OneVsRestClassifier(KNeighborsClassifier())


    lScoringMetric = ['precision_micro', 'recall_micro', 'f1_micro']
    lScores = cross_validate(classifier, X, y=y,
                             cv=10,
                             scoring=lScoringMetric,
                             return_train_score=False)


    # finally output the average F1 score
    print("Average precision: " + str(np.mean(lScores['test_'+lScoringMetric[0]])))
    print("Average recall: " + str(np.mean(lScores['test_' + lScoringMetric[1]])))
    print("Average F1: " + str(np.mean(lScores['test_' + lScoringMetric[2]])))


llLabels:  7033
lPosts:  7033
Shape of document-word matrix X (posts, words):  (7033, 10234)
Shape of y:  (7033, 169)
What open-source books (or other materials) provide a relatively thorough overview of data science?As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers. What open-source books (or other materials) provide a relatively thorough overview of data science?As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.
['education']
  (0


KeyboardInterrupt

