<center><h1>Reddit Data Analysis(EDA)-v1</h1></center>
<hr>

## What is about..

Explore the Reddit Dataset.

## Thanks to.. <a id="top"></a>

> **Kaggle Data**<br>
> 
> [Reddit - Data is Beautiful](https://www.kaggle.com/unanimad/dataisbeautiful)<br>

> **Questions**<br>
> 
> [tqdm: Using progress bar in pandas apply function](https://stackoverflow.com/questions/18603270/progress-indicator-during-pandas-operations)<br>
> [nltk: pos_tag + lemmatiza](https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python)<br>
> [wordcloud: how to draw wordcloud](https://lovit.github.io/nlp/2018/04/17/word_cloud/)<br>
> [sci-kit learn: count vectorizer get feature name](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)<br>

## Table of Contents.. <a id="top"></a>

1. [Problem Description](#1)
2. [Data Description](#2)
3. [Environment Setting](#3)
    1. [Import Library](#3.1)
    2. [Load Dataset](#3.2)
4. [Data Preprocessing](#4)
    1. [Missing Value](#4.1)
    2. [Time Management](#4.2)
    3. [OC(Original Content)](#4.3)
    4. [Text Cleaning](#4.4)
5. [Exploratory Data Analysis(EDA)](#5)
    1. [Distribution of Numerical Value](#5.1)
    2. [Time Analysis](#5.2)
    3. [Wordcloud Text Analysis](#5.3)
6. [Word Embedding](#6)
    1. [Count Vectorizer](#6.1)
    2. [TF-IDF Vectorizer](#6.2)
7. [Data Modeling](#7)
    1. [Dimension Reduction](#7.1)
    2. [Classification](#7.2)
    3. [Topic Modeling](#7.3)

<hr>

# 1. Problem Description <a id="1"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

Using Reddit Dataset, do Exploratory Data Analysis. Question and Validation using Data Visualization.

# 2. Data Description <a id="2"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

Reddit Data is Beautiful - from [Kaggle](https://www.kaggle.com/unanimad/dataisbeautiful)
> **About**<br>
> 
> Data is Beautiful, r/dataisbeautiful, is a place for visual representations of data: Graphs, charts, maps, etc. DataIsBeautiful is for visualizations that effectively convey information. Aesthetics are an important part of information visualization, but pretty pictures are not the aim of this subreddit.

> **Content**<br>
> 
> This dataset contains a couple of fields with the information based on Reddit post submission, such:
> Fields:
> * id
> * title
> * score
> * author
> * authorfalirtext
> * removed_by
> * totalawardsreceived
> * awarders
> * created_utc
> * full_link
> * num_commnets
> * over_18

> **Method**<br>
> 
> The data was extracted using the PushShift API for Reddit. Thanks Watchful1 for show me this API.

# 3. Environment Setting<a id="3"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

## 3.1. Import Library<a id="3.1"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
# Image
from PIL import Image

# Python Collectino
from collections import Counter

# FOR Loop Verbose
from tqdm import tqdm

# System
import os

# String
import string

# Natural Language Processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer

# Date and Time
import datetime

# Dataframe
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Numerical Data
import numpy as np

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
data_path = '/usr/share/nltk_data'
print(data_path)
if not os.path.exists(data_path):
    nltk.download()
nltk.data.path.append(data_path)

In [None]:
pd.options.display.max_rows = 499
pd.options.display.max_columns = 499
pd.options.mode.chained_assignment = None

In [None]:
%matplotlib inline

## 3.2. Load Dataset<a id="3.2"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
raw = pd.read_csv('/kaggle/input/dataisbeautiful/r_dataisbeautiful_posts.csv', encoding='utf-8')
raw

In [None]:
raw.info()

In [None]:
raw.describe(include='all')

In [None]:
raw.isna().sum()

# 4. Data Preprocessing<a id="4"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

## 4.1. Missing Value<a id="4.1"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
cleaned = raw.copy()

**title**<br>
impute title to 'null'

In [None]:
cleaned.title = cleaned.title.fillna('null')

In [None]:
cleaned[cleaned.title == 'null']

**author_flair_text, removed_by, total_awards_received, awarders**<br>
drop

In [None]:
columns = ['author_flair_text', 'removed_by', 'total_awards_received', 'awarders']
cleaned = cleaned.drop(columns, axis=1)
cleaned

In [None]:
cleaned.isna().sum()

## 4.2. Time Management<a id="4.2"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def utc_to_datetime(data):
    data['year'] = data['created_utc'].apply(lambda utc: datetime.datetime.fromtimestamp(utc).year)
    data['month'] = data['created_utc'].apply(lambda utc: datetime.datetime.fromtimestamp(utc).month)
    data['day'] = data['created_utc'].apply(lambda utc: datetime.datetime.fromtimestamp(utc).day)    
    data['hour'] = data['created_utc'].apply(lambda utc: datetime.datetime.fromtimestamp(utc).hour)
    data['minute'] = data['created_utc'].apply(lambda utc: datetime.datetime.fromtimestamp(utc).minute)
    data['second'] = data['created_utc'].apply(lambda utc: datetime.datetime.fromtimestamp(utc).second)    

In [None]:
utc_to_datetime(cleaned)
cleaned

## 4.3. OC(Original Content)<a id="4.3"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
cleaned['original_content'] = cleaned['title'].str.contains('[OC]').astype(int)
cleaned

## 4.4. Text Cleaning<a id="4.4"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def get_wordnet_tag(tag):
    if tag == 'ADJ':
        return 'j'
    elif tag == 'VERB':
        return 'v'
    elif tag == 'NOUN':
        return 'n'
    elif tag == 'ADV':
        return 'r'
    else:
        return 'n'

In [None]:
def lemmatize_text(title):
    stop = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    words = list()
    title = word_tokenize(title)
    for word, tag in pos_tag(title):
        tag = get_wordnet_tag(tag)
        word = lemmatizer.lemmatize(word, tag)
        if word not in stop:
            words.append(word)
    
    return ' '.join(words)        

In [None]:
def clean_text(dataset):
    
    tqdm.pandas()
    
    dataset['title_cleaned'] = dataset['title'].str.lower()
    dataset['title_cleaned'] = dataset['title_cleaned'].str.replace(r'\[oc\]', ' ')
    pattern_link = r'https?://[^\s]+|www\.[^\s]+|[^\s]+\.com[^\s]*|[^\s]+\.org[^\s]*|[^\s]+\.html[^\s]*'
    dataset['title_cleaned'] = dataset['title_cleaned'].str.replace(pattern_link, ' link ')
    
    pattern_punctuation = r'[' + string.punctuation + '’]'
    dataset['title_cleaned'] = dataset['title_cleaned'].str.replace(pattern_punctuation, '')
    dataset['title_cleaned'] = dataset['title_cleaned'].str.replace(r' [\d]+ |^[\d]+ | [\d]+$', ' ')
    dataset['title_cleaned'] = dataset['title_cleaned'].str.replace(r'[^\w\d\s]+', ' ')
    dataset['title_cleaned'] = dataset['title_cleaned'].progress_apply(lambda title: lemmatize_text(title))
    
    dataset['title_cleaned'] = dataset['title_cleaned'].str.replace(r'\s[\s]+', ' ')

In [None]:
clean_text(cleaned)
cleaned

## 4.5. Data Type Conversion<a id="4.5"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
cleaned['over_18'] = cleaned['over_18'].apply(lambda x: int(x))
cleaned

# 5. Exploratory Data Analysis(EDA)<a id="5"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

## 5.1. Distribution of Numerical Value<a id="5.1"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def boxplot(data, feature, base):
    assert base in ['over_18', 'original_content']
    
    plt.figure(figsize=(30, 12))
    sns.boxplot(x=base, y=feature, data=data)
    plt.show()

In [None]:
cleaned['score'].value_counts()

In [None]:
boxplot(cleaned, 'score', 'over_18')

In [None]:
boxplot(cleaned, 'score', 'original_content')

In [None]:
cleaned['num_comments'].value_counts()

In [None]:
boxplot(cleaned, 'num_comments', 'over_18')

In [None]:
boxplot(cleaned, 'num_comments', 'original_content')

## 5.2. Time Analysis<a id="5.2"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def countplot(data, by='year'):
    assert by in ['year', 'month', 'day']
    data_copy = data.copy()
    data_copy['year'] = data_copy['year'].astype(str)
    data_copy['month'] = data_copy['month'].astype(str)
    data_copy['day'] = data_copy['day'].astype(str)

    plt.figure(figsize=(30, 10))
    if by == 'year':
        stat = data_copy['year'].value_counts()        
        sns.countplot(by, data=data_copy)
        plt.xlabel(by)
    elif by == 'month':
        data_copy['month'] = data_copy['year'] + '/' + data_copy['month']
        stat = data_copy['month'].value_counts()        
        sns.countplot(by, data=data_copy)
        plt.xlabel(by)
        plt.xticks(rotation=45)
    elif by == 'day':
        data_copy['day'] = data_copy['year'] + '/' + data_copy['month'] + '/' + data_copy['day']
        stat = data_copy['day'].value_counts()        
        sns.countplot(by, data=data_copy)            
        
    plt.ylabel('count')
    plt.title('Count by Year/Month/Day Recent to Old')
    plt.show()
    
    return stat
    

In [None]:
countplot(cleaned, 'year')

In [None]:
countplot(cleaned, 'month')

In [None]:
countplot(cleaned, 'day')

> What happened on April 1st, 2nd?

## 5.3. Wordcloud Text Analysis<a id="5.3"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def wordcloud(dataset, min_freq=1):
    bow = list()
    for title in tqdm(dataset['title_cleaned']):
        bow += word_tokenize(title)
    
    word_freq = dict()
    counter = Counter(bow)
    for word, freq in counter.items():
        if freq >= min_freq:
            word_freq[word] = freq
    
#     reddit_mask = np.array(Image.open('/kaggle/working/reddit_icon.png'))    
    
    wc = WordCloud(width=800, height=800, background_color='white') #, mask=reddit_mask)
    wc = wc.generate_from_frequencies(word_freq)
    
    plt.figure(figsize=(12, 12))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()

    
    return counter, word_freq

In [None]:
counter, word_freq = wordcloud(cleaned)

In [None]:
def wordcloud_by_date(dataset, year=None, month=None, day=None):
    dataset_cp = dataset.copy()
    
    if year:
        dataset_cp = dataset_cp[dataset_cp['year'] == year]
    if month:
        dataset_cp = dataset_cp[dataset_cp['month'] == month]
    if day:
        dataset_cp = dataset_cp[dataset_cp['day'] == day]
    
    return wordcloud(dataset_cp)

In [None]:
counter_2020, word_freq_2020 = wordcloud_by_date(cleaned, year=2020)

In [None]:
counter_20170401, word_freq_20170401 = wordcloud_by_date(cleaned, year=2017, month=4, day=1)

In [None]:
counter_20170402, word_freq_20170402 = wordcloud_by_date(cleaned, year=2017, month=4, day=2)

In [None]:
counter_201704, word_freq_201704 = wordcloud_by_date(cleaned, year=2017, month=4)

In [None]:
counter_04, word_freq_04 = wordcloud_by_date(cleaned, month=4)

> What happened on April 1st, 2nd? -> DATAIRL(DATA In Real Life)

> coronavirus is overwhelming

# 6. Word Embedding<a id="6"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

## 6.1. Count Vectorizer<a id="6.1"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def count_vectorize(dataset):
    
    vectorizer = CountVectorizer()
    
    documents = list()
    for title in tqdm(dataset['title_cleaned']):
        documents.append(title)
    document_vector = vectorizer.fit_transform(documents)
    return vectorizer, document_vector

In [None]:
cv, cv_encoded = count_vectorize(cleaned)

In [None]:
cv_encoded.shape

In [None]:
for i, j in zip(cv_encoded.nonzero()[0][:30], cv_encoded.nonzero()[1][:30]):
    print('({:4}, {:8}({:15})) -> {}'.format(i, j, cv.get_feature_names()[j], cv_encoded[i, j]))

## 6.2. TF-IDF Vectorizer<a id="6.2"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def tfidf_vectorize(dataset):
    vectorizer = TfidfVectorizer()
    
    documents = list()
    for title in tqdm(dataset['title_cleaned']):
        documents.append(title)
    document_vector = vectorizer.fit_transform(documents)
    return vectorizer, document_vector

In [None]:
tfidf, tfidf_encoded = tfidf_vectorize(cleaned)

In [None]:
tfidf_encoded.shape

In [None]:
for i, j in zip(tfidf_encoded.nonzero()[0][:30], tfidf_encoded.nonzero()[1][:30]):
    print('({:4}, {:8}({:15})) -> {}'.format(i, j, tfidf.get_feature_names()[j], tfidf_encoded[i, j]))

# 7. Data Modeling<a id="7"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

## 7.1. Dimension Reduction<a id="7.1"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
oc = cleaned[cleaned['original_content'] == 1]
noc = cleaned[cleaned['original_content'] == 0]

In [None]:
def scatter(data, x):
    oc = data[data['original_content'] == 1]
    noc = data[data['original_content'] == 0]

    plt.figure(figsize=(10, 10))
    plt.scatter(x[oc.index, 0], x[oc.index, 1], color='red', label='Original Content')
    plt.scatter(x[noc.index, 0], x[noc.index, 1], color='blue', label='Not Original Content')
    plt.legend()
    plt.title('Sample 2-Dimension Features')
    plt.show()

**Truncated SVD(for Sparse Data)**

In [None]:
def svd(encoded, dimension=50):
    svd = TruncatedSVD(n_components=dimension, n_iter=10, random_state=2020)
    reduced = svd.fit_transform(encoded)
    return svd, reduced

In [None]:
svd50, svd50_reduced = svd(tfidf_encoded)

In [None]:
svd50_reduced.shape

In [None]:
scatter(cleaned, svd50_reduced)

**T-SNE**<br>
Tooooooo Many Times needed

In [None]:
def tsne(encoded, dimension=2):
    tsne = TSNE(n_components=dimension, verbose=5, random_state=2020, n_jobs=4)
    reduced = tsne.fit_transform(encoded)
    return tsne, reduced

In [None]:
# tsne2, tsne2_reduced = tsne(svd50_reduced)

In [None]:
# tsne2_reduced.shape

In [None]:
# scatter(cleaned, tsne2_reduced)

## 7.2. Classification<a id="7.2"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

Is it possible to discriminate Original Content vs Non Original Content by title? -> About 60% Accuracy, may need model tuning

In [None]:
X = svd50_reduced
Y = np.array(cleaned['original_content'])

In [None]:
X.shape

In [None]:
Y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    X, Y,
    test_size=0.2,
    stratify=Y
)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
def gradient_boosting_model(x_train, y_train, x_test, y_test):
    model = GradientBoostingClassifier(learning_rate=0.01, n_estimators=100, random_state=2020, verbose=1)
    scores = cross_validate(model, x_train, y_train, scoring='accuracy', cv=2, return_train_score=True, verbose=1)

    model.fit(x_train, y_train)
    acc = accuracy_score(model.predict(x_test), y_test)
    
    return model, scores, acc

In [None]:
gb_model, gb_scores, gb_acc = gradient_boosting_model(x_train, y_train, x_test, y_test)

In [None]:
print('Validation Accuracies: {}'.format(gb_scores['test_score']))
print('Test Accuracy: {}'.format(gb_acc))

## 7.3. Topic Modeling<a id="7.3"></a>
<p style="text-align:right;"><a href="#top">🔝 top</a></p>

In [None]:
def lda(encoded, n_topic=10):
    lda = LatentDirichletAllocation(n_components=n_topic, verbose=1, random_state=2020)
    lda.fit(encoded)
    return lda

In [None]:
lda10 = lda(tfidf_encoded, n_topic=10)

In [None]:
for idx, topic in enumerate(lda10.components_):
    words = [tfidf.get_feature_names()[topic_id] for topic_id in topic.argsort()[::-1][:10]]
    print('Topic {:2d} -> {}'.format(idx, words))

In [None]:
topic_df = cleaned.copy()
length = cleaned['title_cleaned'].shape[0]
for idx, title in tqdm(enumerate(cleaned['title_cleaned'])):
    encoded = tfidf_encoded[idx]
    topics = lda10.transform(encoded)
    topic = topics.argsort()[0][::-1][0]

    topic_df.loc[idx, 'topic'] = topic
    topic_df.loc[idx, 'topic_value'] = topics[0][topic]

    if idx % 30000 == 0 or idx == length - 1:
        print('Topic {:2d}({:.6f}) {:}'.format(topic, topics[0][topic], title))
topic_df

In [None]:
topic_df['topic'] = topic_df['topic'].astype(int)

In [None]:
plt.figure(figsize=(12, 12))
sns.countplot(topic_df['topic'])
plt.xlabel('Topic')
plt.title('Topic Counter')
plt.show()

In [None]:
svd2, svd2_reduced = svd(tfidf_encoded, dimension=2)

In [None]:
svd2_reduced.shape

In [None]:
scatter(cleaned, svd2_reduced)

In [None]:
plt.figure(figsize=(12, 12))
for topic in sorted(topic_df['topic'].unique()):
    index = topic_df[topic_df['topic'] == topic].index
    sns.scatterplot(x=svd2_reduced[index, 0], y=svd2_reduced[index, 1], label=str(topic), s=100)
plt.legend()
plt.title('Topic Distribution in 2-d representation')
plt.show()