## Final Data Used for the model

We had to decrease the size of the training data by 90% due to deployment issues.

In [1]:
import pandas as pd

df = pd.read_csv('limited_trianing2.csv')
df.shape

(3611, 2)

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,text
0,0,"AskReddit ha hit 25,000,000 subscribers! (inse..."
1,1,Epstein jail guard had been offered plea dealT...
2,2,Subreddit Of The Month [November 2019]: /r/sud...
3,3,AmItheButtface: Where do all the other post go...
4,4,Free Talk Friday!A note about suspected advert...


# Fit the Model

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = TfidfVectorizer(stop_words='english', min_df=0.01, max_df=.99, ngram_range=(1,2))
cv.fit(df.text.values)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.99, max_features=None, min_df=0.01,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

## Transform the training data

In [11]:
cv_trans = cv.transform(df.text.values)

## Vectorize the data with todense() and put into dataframe

In [12]:
cv_df = pd.DataFrame(cv_trans.todense(), columns=cv.get_feature_names())
cv_df.shape

(3611, 4716)

In [13]:
cv_df.head()

Unnamed: 0,00,000,01,02,03,04,05,06,07,08,...,youtubers,yr,yr old,yt,zealand,zelda,zero,zombie,zone,zoom
0,0.0,0.173334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.040379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.049908,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060965,0.0,0.0


# Transform the test data


In [14]:
test = """Former U.K. Consulate Staffer In Hong Kong Says He Was Tortured In Mainland China """

cv_test = cv.transform([test])
pd.DataFrame(cv_test.todense(), columns=cv.get_feature_names()).head()

Unnamed: 0,00,000,01,02,03,04,05,06,07,08,...,youtubers,yr,yr old,yt,zealand,zelda,zero,zombie,zone,zoom
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Use Sklearn Cosine Similarity to find the best matching subreddits

The index of the result matches with unique subreddit's in our NoSQL database.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

dist_matrix  = cosine_similarity(cv_df, cv_test.todense())

results = pd.DataFrame(dist_matrix)
results[0].sort_values(ascending=False)[0:10]

224     0.638370
1789    0.636266
1485    0.601441
18      0.582280
646     0.440852
5       0.239219
1166    0.210401
3228    0.209769
1698    0.205634
737     0.200523
Name: 0, dtype: float64

## Check the results

The resutls seem to be very relivant to our test data. This is a good sign.

In [20]:
import pymongo
secret_code = ""

best_results = [224, 1485, 18, 646]


def get_data(array):
    client = pymongo.MongoClient(secret_code)
    db = client.sfw_db
    data = [db.sfw_db.find({'sub_id':int(num)})[0] for num in array]
    return(data)

get = get_data(best_results)

for i in get:
    print(i['name'])
    print(i['title'])
    print('----------')

Hong_Kong
The community for discussion of news, politics, current events, history in Hong Kong, China.
----------
Sino
Sino: News, Information, Discussion on all things China and Chinese Related
----------
HongKong
香港
----------
LIHKG
LIHKG
----------


# Export the fit model and the transformed training data

In [21]:
import joblib

joblib.dump(cv, 'cd_model.joblib')
joblib.dump(cv_trans, 'transformed_data.joblib')

['transformed_data.joblib']