# Text Mining & Text Analytics

### *Objective*:
To apply text mining techniques to perform document classification. You will train a machine learning model to distinguish between two types of posts from Reddit: those related to Data Science and those related to Game of Thrones. The goal is to explore how text mining can be used for categorizing documents and gain insights into real-world applications like spam filtering, sentiment analysis, and topic detection.

In [1]:
# Libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%pip install -q wordcloud nltk seaborn matplotlib pandas numpy
import nltk
import random
import string
import re
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')



Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to /home/roxel/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/roxel/nltk_data...
[nltk_data] Downloading package punkt to /home/roxel/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /home/roxel/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Since I couldn't find an existing dataset that has reddit posts about both, data_science & GOT, I've decided to make things a little interesting and creating my own dataset from existing datasets on reddit posts about data_science & GOT. 

In [4]:
# %pip install kagglehub
import kagglehub

# Download the datasets from Kaggle
path_a = kagglehub.dataset_download("nikhilkhetan/game-of-thrones")
print("Path to GOT dataset:", path_a)

path_b = kagglehub.dataset_download("maksymshkliarevskyi/reddit-data-science-posts")
print("Path to DataSci dataset:", path_b)

Note: you may need to restart the kernel to use updated packages.
Downloading from https://www.kaggle.com/api/v1/datasets/download/nikhilkhetan/game-of-thrones?dataset_version_number=1...


100%|██████████| 339k/339k [00:00<00:00, 428kB/s]

Extracting files...





Path to GOT dataset: /home/roxel/.cache/kagglehub/datasets/nikhilkhetan/game-of-thrones/versions/1
Downloading from https://www.kaggle.com/api/v1/datasets/download/maksymshkliarevskyi/reddit-data-science-posts?dataset_version_number=4...


100%|██████████| 114M/114M [00:32<00:00, 3.70MB/s] 

Extracting files...





Path to DataSci dataset: /home/roxel/.cache/kagglehub/datasets/maksymshkliarevskyi/reddit-data-science-posts/versions/4


In [5]:
# Load the datasets
path_a = '../data/GameofThrones.csv'
path_b = '../data/reddit_database.csv'

got_df = pd.read_csv(path_a)
ds_df = pd.read_csv(path_b)

In [6]:
# inspecting columns to merge both datasets
print(ds_df.columns)
print(got_df.columns)


Index(['created_date', 'created_timestamp', 'subreddit', 'title', 'id',
       'author', 'author_created_utc', 'full_link', 'score', 'num_comments',
       'num_crossposts', 'subreddit_subscribers', 'post'],
      dtype='object')
Index(['title', 'score', 'id', 'url', 'comms_num', 'created', 'body',
       'timestamp'],
      dtype='object')


Matching Columns: {title, id, score, post:body}

In [7]:
ds_df['text'] = ds_df['title'].fillna('') + ' ' + ds_df['post'].fillna('')
got_df['text'] = got_df['title'].fillna('') + ' ' + got_df['body'].fillna('')


In [8]:
s_df = ds_df[ds_df['text'].str.strip() != '']
got_df = got_df[got_df['text'].str.strip() != '']


In [10]:
n = min(len(ds_df), len(got_df), 5000)  # Selecting a minimum of 5000 samples from each dataset or less if available
ds_sample = ds_df.sample(n=n, random_state=42)[['text']].copy()
got_sample = got_df.sample(n=n, random_state=42)[['text']].copy()

In [11]:
ds_sample['category'] = 'data science'
got_sample['category'] = 'game of thrones'

In [None]:
base = pd.concat([ds_sample, got_sample], ignore_index=True)
base = base.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
base.to_csv('../data/reddit_posts.csv', index=False)

In [None]:
# Loading the dataset
# df = pd.read_csv('../data/reddit_posts.csv')
# not needed in my case right now, but keeping it for future reference