<h1>Task I Text Classification</h1>
<h3>The traget of text classification task</h3>

<pre>The most representative task is to classify text data, e.g., documents, comments and literatures.</pre>

<img src='https://monkeylearn.com/static/img/text-classification/Text-classification-model@2x.png'>

<h3>1. Preparing environments</h3>

<pre>Before import you should install these libraries by pip command.</pre>

In [1]:
import numpy as np
import pandas as pd
import json
import requests
import pickle
import sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier

<h3>2. Clean data (Optional)</h3>

<pre>You should ignore those spectial cases which are useless even harmful to the training process.</pre>

In [2]:
def clean_line(t):
    return (t.replace('\n', ' ')
            .replace('\r', ' ')
            .replace('\t', ' ')
            .replace('  ', ' ')
            .strip())

<h3>3. Load Data</h3>

<pre>In this step, you should transform text data to vectors that can be processed by machine learning models.</pre>
<img src='https://monkeylearn.com/static/img/text-classification/text_process_training.png'>
<pre>One classical approach is TF-IDF. The details of this method can be found at <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html">TF-IDF</a>. Even though you are not familiar with this method you can easily use the API provided by Scikit-learn.</pre>
<pre>Please finish the codes below.</pre>

In [3]:
def load_and_process_data():
    # only use text data longer than 50
    min_len = 50

    all_data = fetch_20newsgroups(
        subset='all', remove=('headers', 'footers', 'quotes'))

    # clean data
    all_text = [clean_line(t) for t in all_data.data]
    all_data_df = pd.DataFrame({'text': all_text, 'topics': all_data.target})

    cleaned_df = all_data_df[all_data_df.text.str.len() > min_len]

    X_raw = cleaned_df['text'].values
    y_raw = cleaned_df['topics'].values

    # split the data with test_size=0.20 and random_state = 42
    # your codes are here
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(
        X_raw, y_raw, test_size=0.20, random_state=42)
    # end

    # tranform data to vectors using tf-idf approach
    # your codes are here
    tfidf = TfidfVectorizer()
    X_train_tfidf = tfidf.fit_transform(X_train_raw)
    X_test_tfidf = tfidf.transform(X_test_raw)
    # end

    return X_train_tfidf, X_test_tfidf, y_train, y_test

In [4]:
X_train, X_test, y_train, y_test = load_and_process_data()

<h3>4. Training and Prediction</h3>
<img src="https://monkeylearn.com/static/img/text-classification/text_process_prediction.png">
<pre>In this step you are asked to use three traditional machine learning methods (Support Vector Machines, Random Forest and K-nearest Neighbors) to finish this text classification task.</pre>

<pre>a. SVM</pre>
<pre>Please finish the code below</pre>

In [7]:
def trainSVM(X_train, X_test, y_train, y_test, cpu=1):
    sgd_clf = SGDClassifier(n_jobs=cpu)
    sgd_clf.fit(X_train, y_train)
    predicted = sgd_clf.predict(X_test)
    mean = np.mean(predicted == y_test)
    print("The Result of SVM is: " + str(mean))

<pre>b. Random Forest</pre>
<pre>Please finish the code below</pre>

In [8]:
def trainRF(X_train, X_test, y_train, y_test, trees=100, cpu=1):
    rf_clf = RandomForestClassifier(n_estimators=trees, n_jobs=cpu)
    rf_clf.fit(X_train, y_train)
    predicted = rf_clf.predict(X_test)
    mean = np.mean(predicted == y_test)
    print("The Result of Random Forest is: " + str(mean))

<pre>c. K-Nearest Neighbors</pre>
<pre>Please finish the code below</pre>

In [9]:
def trainKNN(X_train, X_test, y_train, y_test, cpu=1, n_neighbors=5):
    kn_clf = KNeighborsClassifier(n_jobs=cpu, n_neighbors=n_neighbors)
    kn_clf.fit(X_train, y_train)
    predicted = kn_clf.predict(X_test)
    mean = np.mean(predicted == y_test)
    print("The Result of K-Nearest Neighbors is: " + str(mean))

<h3>5. Conclusion</h3>
<pre>Please rank these three models by their accuracy on test data.</pre>

In [10]:
def main():
    X_train, X_test, y_train, y_test = load_and_process_data()
    print("Before Tuning the Parameters: ")
    trainSVM(X_train, X_test, y_train, y_test)
    trainRF(X_train, X_test, y_train, y_test)
    trainKNN(X_train, X_test, y_train, y_test)
    print("SVM > Random Forest > KNN")
    print("After Tuning the Parameters: ")
    trainSVM(X_train, X_test, y_train, y_test, -1)
    trainRF(X_train, X_test, y_train, y_test, 1000, -1)
    trainKNN(X_train, X_test, y_train, y_test, -1, 20)
    print("After tuning the parameters. Random Forest's accuracy increase 5~6% and KNN's accuracy increase up to 9%\n")
    print("SVM > Random Forest > KNN")


main()

Before Tuning the Parameters: 
The Result of SVM is: 0.7857941834451901
The Result of Random Forest is: 0.6700223713646533
The Result of K-Nearest Neighbors is: 0.5369127516778524
SVM > Random Forest > KNN
After Tuning the Parameters: 
The Result of SVM is: 0.7869127516778524
The Result of Random Forest is: 0.7038590604026845
The Result of K-Nearest Neighbors is: 0.6219239373601789
After tuning the parameters. Random Forest's accuracy increase 5~6% and KNN's accuracy increase up to 9%

SVM > Random Forest > KNN


<pre>
Accuracy on test data trained by the same data.
SVM > Random Forest > KNN
</pre>