# Medium Articles EDA

In this kernel `intense EDA` is performed on [Medium Articles](https://www.kaggle.com/hsankesara/medium-articles) by [Hsankesara](https://www.kaggle.com/hsankesara) where he dataset contains `articles`, their `title`, `number of claps` it has received, their `links` and their `reading time`.

**While doing this we'll go through:**
- Preprocessing of text data
- Removing outliers using `IQR` and `z-score` methods
- Data visualization using `seaborn` and `word cloud`
- Building `classes` following the `DRY` convention

During `EDA` we'll use the preprocessed data to answer different questions.

![](https://media.giphy.com/media/5zf2M4HgjjWszLd4a5/giphy.gif)

In [None]:
import re
import math
import string
from random import randint

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import zscore
from nltk.corpus import stopwords
from wordcloud import STOPWORDS, WordCloud

from sklearn.preprocessing import MinMaxScaler

In [None]:
# Loading dataset
df = pd.read_csv('/kaggle/input/medium-articles/articles.csv')
df.sample(5)

In [None]:
df.info()

No missing data

`CustomFormatter` class will have helper functions to format strings, just for `extra touch` 🍷.

In [None]:
# Formatter to format anything
class CustomFormatter:
    def __init__(self):
        pass
    
    # convert number to K 
    # eg. 1,000 to 1K
    # can't think of any better name for this func
    @staticmethod
    def format_likes_number_to_str(number):
        rounded_num = round(number / 1000, 2)
        frac, whole = math.modf(rounded_num)
        frac = round(frac, 2) if frac != 0 else 0
        return f'{int(whole) + frac}K'

    
print(CustomFormatter.format_likes_number_to_str(5000))
print(CustomFormatter.format_likes_number_to_str(5200))

## Data Preparation

Here we are going to clean the data known as `data cleaning` process and transform it for use know `data wrangling` process.

> Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data's format, typically by converting “raw” data into another format more suitable for use.

Convert `claps` dtype from str to int.

In [None]:
def convert_clap_dtype(clap_str):
    if 'K' not in clap_str:
        # 32
        return int(clap_str)
    if 'K' in clap_str:
        # 32K & 3.2K
        return int(float(clap_str.split('K')[0]) * 1000)
    print(f'🌊 Anomaly: {clap_str}')
    return clap_str


df.claps = df.claps.apply(convert_clap_dtype)
df.claps.values[:10].tolist()

Creating a `domain` column will have all the links for the `articles` which I you want you can scrape data for more data analysis.

In [None]:
def extract_domain(link):
    return link.split('https://')[1].split('/')[0]


df['domain'] = df.link.apply(extract_domain)
df.domain.values[:10].tolist()

In [None]:
# Remove puncuation from word
def rm_punc_from_word(word):
    clean_alphabet_list = [
        alphabet for alphabet in word if alphabet not in string.punctuation]
    return ''.join(clean_alphabet_list)


print(rm_punc_from_word('#cool!'))

In [None]:
# Remove puncuation from text
def rm_punc_from_text(text):
    clean_word_list = [rm_punc_from_word(word) for word in text]
    return ''.join(clean_word_list)


print(rm_punc_from_text("Frankly, my dear, I don't give a damn"))

In [None]:
# Remove numbers from text
def rm_number_from_text(text):
    text = re.sub('[0-9]+', '', text)
    return ' '.join(text.split())  # to rm `extra` white space


print(rm_number_from_text('You are 100times more sexier than me'))
print(rm_number_from_text('If you taught yes then you are 10 times more delusional than me'))

In [None]:
# Remove stopwords from text
def rm_stopwords_from_text(text):
    _stopwords = stopwords.words('english')
    text = text.split()
    word_list = [word for word in text if word not in _stopwords]
    return ' '.join(word_list)


rm_stopwords_from_text("Love means never having to say you're sorry")

`clean_text` is the function used to apply all the `filters` for cleaning the `string` data i.e. the text here.

In [None]:
def clean_text(text):
    text = text.lower()
    text = rm_punc_from_text(text)
    text = rm_number_from_text(text)
    text = rm_stopwords_from_text(text)

    # there are hyphen(–) in many titles, so replacing it with empty str
    # this hyphen(–) is different from normal hyphen(-)
    text = re.sub('–', '', text)
    text = ' '.join(text.split())  # removing `extra` white spaces

    return text


clean_text("Mrs. Robinson, you're trying to seduce me, aren't you?")

Cleaning the texts in `text` and `title` columns in our `df`.

In [None]:
df.text = df.text.apply(clean_text)
df.title = df.title.apply(clean_text)

df.title.values[:10].tolist()

In [None]:
# Getting articles length
def get_article_len(text):
    return len(text)


df['article_length'] = df.text.apply(get_article_len)
df.article_length.values[:10].tolist()

## Exploratory Data Analysis

> Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

For more info on `EDA` read the following posts: [Post_1](https://www.ibm.com/cloud/learn/exploratory-data-analysis) and [Post_2](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)

So let's explore the data.

![](https://media.giphy.com/media/l4KibOaou932EC7Dy/giphy.gif)

In [None]:
df.columns.tolist()

In [None]:
# Distribution of claps in our data
def display_histplot_for_claps(df, claps_threshold=2_000):
    claps_threshold_str = CustomFormatter.format_likes_number_to_str(claps_threshold)
    
    f, axs = plt.subplots(1, 2, figsize=(16, 4))

    sns.histplot(x=df.claps, kde=False, ax=axs[0])
    sns.histplot(x=df[df.claps <= claps_threshold].claps, kde=False, ax=axs[1])

    axs[0].set_xlabel('Distribution of all the claps')
    axs[1].set_xlabel(f'Distribution of claps (<= {claps_threshold_str})')

    # percentage of claps less than equal to claps_threshold
    pct_of_clap = round(len(df[df.claps <= claps_threshold]) / len(df), 2) * 100

    print(f' {pct_of_clap}% of articles have less than eqaul to {claps_threshold_str} 👏 claps')

display_histplot_for_claps(df)

The above `distribution plots` shows that there are some outliers in claps column.

In [None]:
sns.boxplot(x=df.claps) 

The `claps` greater than `15K` are the `outliers` as they are not included in the box of other observation i.e no where near the `quartiles`.

To know more about `detecting and removing outliers` read the following [post](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba).

In [None]:
# ### Removing outliers using Z score ###

# getting zscores of all the claps
claps_zscores = np.abs(zscore(df.claps))

# keeping the threshold of 3 (above which a clap will be an outlier)
# instead of 3, -3 can also be kept as threshold & in this case claps below -3 will be an outlier
clap_outliers_row_idx = np.where(claps_zscores > 3)[0].tolist()

# removing outliers
df.drop(clap_outliers_row_idx, axis='rows', inplace=True)

sns.boxplot(x=df.claps)

In [None]:
# ### Removing outliers using IQR ###

claps_q1 = df.claps.quantile(0.25)
claps_q3 = df.claps.quantile(0.75)
iqr = claps_q3 - claps_q1
print(f'IQR for claps: {iqr}')

clap_outliers_row_idx = df.claps[(df.claps < (claps_q1 - 1.5 * iqr)) | (df.claps > (claps_q3 + 1.5 * iqr))].index.tolist()

# removing outliers
df.drop(clap_outliers_row_idx, axis='rows', inplace=True)

sns.boxplot(x=df.claps)

In [None]:
# Helper functions to remove outliers


# Using IQR method
def rm_outliers_in_col_using_iqr(df, col, inplace=False):
    Q1 = col.quantile(0.25)
    Q3 = col.quantile(0.75)
    IQR = Q3 - Q1
    print(f'IQR: {IQR}')
    outliers_row_idx = col[(col < (Q1 - 1.5 * IQR)) | (col > (Q3 + 1.5 * IQR))].index.tolist()
    return df.drop(outliers_row_idx, axis='rows', inplace=inplace)


# Using the Zscore method
def rm_outliers_in_col_using_zscore(df, col, inplace=False, threshold=3):
    zscores = np.abs(zscore(col))
    outliers_row_idx = np.where(zscores > threshold)[0].tolist()
    return df.drop(outliers_row_idx, axis='rows', inplace=inplace)

In [None]:
# removing remaining outliers 
for _ in range(10):
    rm_outliers_in_col_using_iqr(df, df.claps, inplace=True)

sns.boxplot(x=df.claps)
    
# removing the outliers for claps column multiple time reason `maybe` that 
# majority of the claps are less 3K and the outliers were spread very far

In [None]:
# distribution of reading_time in our data
def display_histplot_for_reading_time(df):
    sns.histplot(
        x=df.reading_time, 
        kde=False, bins=range(df.reading_time.max()), 
        color='#e61e64', alpha=.5
    )
    
    avg_reading_time = round(df.reading_time.mean(), 2)
    print(f'The average reading ⏰ time of an article is {avg_reading_time}mins')


display_histplot_for_reading_time(df)

In [None]:
sns.boxplot(x=df.reading_time)

In [None]:
# removing outliers in reading_time column
rm_outliers_in_col_using_iqr(df, df.reading_time, inplace=True)
sns.boxplot(x=df.reading_time)

In [None]:
def display_claps_and_reading_time(df):
    f, axs = plt.subplots(1, 2, figsize=(16, 4))

    sns.scatterplot(
        x='claps', y='reading_time', hue='article_length', data=df, 
        palette='mako', s=80, ax=axs[0]
    )
    sns.histplot(
        x='claps', y='reading_time', data=df, 
        palette='mako', ax=axs[1]
    )


display_claps_and_reading_time(df)
    
# Articles whose reading_time is more than 12.5mins won't get much claps

In [None]:
df[['claps', 'reading_time']].corr() # pearson corr == 0.28...

`claps` & `reading_time` have a `negligible correlation` i.e. they are not correlated.

Color functions to use different colours for `wordcloud` text.

In [None]:
def wc_blue_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(214, 67%%, %d%%)" % randint(60, 100)

def wc_grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(0, 0%%, %d%%)" % randint(60, 100)

def wc_green_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(123, 34%%, %d%%)" % randint(50, 100)

def wc_red_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(23, 54%%, %d%%)" % randint(50, 100)

**Plotting wordclouds**

In [None]:
# stopwords for wordcloud
def get_wc_stopwords():
    wc_stopwords = set(STOPWORDS)

    # Adding words to stopwords 
    # these words showed up while plotting wordcloud for text
    wc_stopwords.add('s')
    wc_stopwords.add('one')
    wc_stopwords.add('using')
    wc_stopwords.add('example')
    wc_stopwords.add('work')
    wc_stopwords.add('use')
    wc_stopwords.add('make')
    
    return wc_stopwords


# get title mega str (combined str of all titles)
def get_title_combined_str(df):
    title_words = []
    for title in df.title.values:
        title_words.extend(title.split())
    return ' '.join(title_words)


# get text mega str (combined str of all text)
def get_text_combined_str(df):
    text_words = []
    for text in df.text.values:
        text_words.extend(text.split())
    return ' '.join(text_words)


# plot wordcloud
def plot_wordcloud_for_title_and_text(title_wc, text_wc, title_color_func, text_color_func):
    f, axs = plt.subplots(1, 2, figsize=(20, 10))
    
    with sns.axes_style("ticks"):
        sns.despine(offset=10, trim=True)

        if not title_color_func:
            # default color
            axs[0].imshow(title_wc, interpolation="bilinear")
            axs[0].set_xlabel('Title WordCloud')
        else:
            # customized color
            axs[0].imshow(title_wc.recolor(color_func=title_color_func, random_state=0), interpolation="bilinear")
            axs[0].set_xlabel('Title WordCloud')
            
        if not title_color_func:
            axs[1].imshow(text_wc, interpolation="bilinear")
            axs[1].set_xlabel('Text WordCloud')
        else:
            axs[1].imshow(text_wc.recolor(color_func=text_color_func, random_state=0), interpolation="bilinear")
            axs[1].set_xlabel('Text WordCloud')

            
# display wordcloud
def wordcloud_for_title_and_text(df, title_color_func=None, text_color_func=None):
    # This str will be used to create wordclouds for title & text
    title_str = get_title_combined_str(df)
    text_str = get_text_combined_str(df)
        
    wc_stopwords = get_wc_stopwords()

    title_wc = WordCloud(stopwords=wc_stopwords, width=800, height=400, random_state=0).generate(title_str)
    text_wc = WordCloud(stopwords=wc_stopwords, width=800, height=400, random_state=0).generate(text_str)
    
    plot_wordcloud_for_title_and_text(title_wc, text_wc, title_color_func, text_color_func)
        
        
wordcloud_for_title_and_text(df, wc_blue_color_func, wc_grey_color_func)

`WordInfo` class will help us to `encapsulate` info about a `words` and will contain helper functions to work with `text` & `title` columns. Basically `WordInfo` class will act as `tokenizer` but is slightly customized as per my needs.

In [None]:
class WordInfo:
    def __init__(self, word, domain, reading_time):
        self.word = word
        self.count = 1
        self.reading_time = reading_time
        
        self.domains = set()  # domains in which it appeared
        self.domains.add(domain)

        
    def increment(self, domain, reading_time):
        self.count += 1
        self.domains.add(domain)
        self.reading_time += reading_time
        
        
    def info(self):
        print(f'Word: {self.word}')
        print(f'Count: {self.count}')
        print(f'Domains: {list(self.domains)}')
        print(f'Reading time: {self.reading_time}mins')
        
        
    @staticmethod
    def exists(word, dictionary):
        return dictionary[word] if word in dictionary.keys() else False
    
    
    @staticmethod
    def increment_or_create(dictionary, word, domain, reading_time):
        if word not in stopwords.words('english'):
            obj = WordInfo.exists(word, dictionary)
            if not obj:
                dictionary[word] = WordInfo(word, domain, reading_time)
            else:
                obj.increment(domain, reading_time)
                
                
    @staticmethod
    def export_count_dict(word_dict):
        _dict = {}
        for wordinfo in list(word_dict.values()):
            _dict[wordinfo.word] = wordinfo.count
        return _dict
    
    
    @staticmethod
    def sort_dict_using_values(_dict):
        # in-place sorting
        words = np.array(list(_dict.keys()))
        counts = np.array(list(_dict.values()))
        
        sorted_idxs = counts.argsort()
        sorted_counts = counts[sorted_idxs]
        new_words_order = words[sorted_idxs]

        # reversing the list (making it from ascending to decending)
        _counts = list(reversed(sorted_counts))
        _words = list(reversed(new_words_order))

        return (_counts, _words)
    
    
    @classmethod
    def word_count_df(cls, _dict):
        word_count_dict = cls.export_count_dict(_dict)
        word_count_sorted = cls.sort_dict_using_values(word_count_dict)

        word_count_df = pd.DataFrame({
            'words': word_count_sorted[1],
            'counts': word_count_sorted[0]
        })

        return word_count_df

Below is an example of how `WordInfo` class will be used to make our `EDA` easy.

In [None]:
# key - words: str
# value - object: WordInfo
WORD_DICT = {}


# To test/see how our WORD_DICT will look 
for word in ['hello', 'world', 'python', 'python', 'tensorflow']:
    WordInfo.increment_or_create(WORD_DICT, word, 'deeplearning.io', 24)
        
print(WORD_DICT)

for obj in WORD_DICT.values():
    print()
    obj.info()

Extracting information about `words` in `title` and `text` columns in `df`.

In [None]:
def get_title_and_text_word_dict(df):
    title_word_dict = {}
    text_word_dict = {}
    
    for domain, title, text, reading_time in df[['domain', 'title', 'text', 'reading_time']].values:
        for word_in_title in title.split():
            WordInfo.increment_or_create(title_word_dict, word_in_title, domain, reading_time)
        for word_in_text in text.split():
            WordInfo.increment_or_create(text_word_dict, word_in_text, domain, reading_time)
            
    return (title_word_dict, text_word_dict)


title_word_dict, text_word_dict = get_title_and_text_word_dict(df)

title_word_dict['medium'].info()
print()
text_word_dict['medium'].info()
print()
title_word_dict['neural'].info()
print()
text_word_dict['neural'].info()

In [None]:
title_word_count_df = WordInfo.word_count_df(title_word_dict)
text_word_count_df = WordInfo.word_count_df(text_word_dict)

In [None]:
def display_word_count(df, top=5, bottom=5):
    # df here is word_count_df
    
    f, axs = plt.subplots(1, 2, figsize=(16, 4))

    # most used words
    sns.barplot(
        x=df.head(top).words, y=df.head(top).counts, 
        color='#473991', alpha=.9, ax=axs[0]
    )

    # least used words
    sns.barplot(
        x=df.tail(bottom).words, y=df.tail(bottom).counts,
        color='#399188', alpha=.9, ax=axs[1]
    )

    axs[0].set_xlabel('Words')
    axs[0].set_ylabel('Counts')  
    axs[1].set_xlabel('Words')
    axs[1].set_ylabel('Counts')  

In [None]:
display_word_count(title_word_count_df)

In [None]:
display_word_count(text_word_count_df)

In [None]:
# top 100 articles with respect to claps
top_atricles_wrt_claps = df.sort_values(by='claps', ascending=False).iloc[:100]
top_atricles_wrt_claps.sample(5)

In [None]:
wordcloud_for_title_and_text(top_atricles_wrt_claps, wc_green_color_func, wc_red_color_func)

Most clapped titles & articles includes AI topics

In [None]:
def get_words_count(text):
    info = {} # {word: count}
    for word in text.split():
        if word in info.keys():
            info[word] += 1
        else:
            info[word] = 1
    return info

`AuthorInfo` class will encapsulate different informations about all the `authors`.

In [None]:
class AuthorInfo:
    # this will contains author info 
    authors_df = pd.DataFrame({
        'name': [],
        'total_claps': [],
        'avg_claps': [],
        'total_reading_time': [],
        'avg_reading_time': []
    })
    
    # this will contain author name & domains
    domains_df = pd.DataFrame({
        'authors': [],
        'domains': []
    })
    
    # this will contain words used by authors & their count i.e. how much
    words_df = pd.DataFrame({
        'authors': [],
        'words': [],
        'counts': [],
        'where': []     # title or text (where is the word used)
    })
    
    
    def __init__(self, author_name, author_df):
        # add author info
        AuthorInfo.authors_df = AuthorInfo.authors_df.append({
            'name': author_name,
            'total_claps': author_df.claps.sum(),
            'avg_claps': author_df.claps.mean(),
            'total_reading_time': author_df.reading_time.sum(),
            'avg_reading_time': author_df.reading_time.mean(),
        }, ignore_index=True)
        
        # add author domains
        for domain in author_df.domain.values:
            AuthorInfo.domains_df = AuthorInfo.domains_df.append({
                'authors': author_name,
                'domains': domain
            }, ignore_index=True)
            
        # add word count
        for title, text in author_df[['title', 'text']].values:
            title_info = get_words_count(title)
            text_info = get_words_count(text)
            AuthorInfo.add_wordcount_using_dict(title_info, author_name, 'title')
            AuthorInfo.add_wordcount_using_dict(text_info, author_name, 'text')            
        
        
    @classmethod
    def add_wordcount_using_dict(cls, _dict, author_name, where):
        for word, count in _dict.items(): 
            cls.words_df = cls.words_df.append({
                'authors': author_name,
                'words': word,
                'counts': count,
                'where': where
            }, ignore_index=True)
            
            
    @classmethod
    def get_domains_using_author_name(cls, author_name):
        return AuthorInfo.domains_df[AuthorInfo.domains_df.authors == author_name].domains.unique().tolist()
    
    
    @classmethod
    def get_wordcount_df(cls, author_name, where, ascending=False):
        return cls.words_df[
            # using ['where'] since where is a method of pd.Series
            (cls.words_df.authors == author_name) & (cls.words_df['where'] == where)
        ].sort_values(by='counts', ascending=ascending)
    

    @classmethod
    def reset_df(cls):
        cls.authors_df = pd.DataFrame({
            'name': [],
            'total_claps': [],
            'avg_claps': [],
            'total_reading_time': [],
            'avg_reading_time': []
        })

        cls.domains_df = pd.DataFrame({
            'authors': [],
            'domains': []
        })

        cls.words_df = pd.DataFrame({
            'authors': [],
            'words': [],
            'counts': [],
            'where': []
        })

In [None]:
for author, author_df in top_atricles_wrt_claps.groupby(by='author'):
    AuthorInfo(author, author_df)

In [None]:
AuthorInfo.domains_df.head()

In [None]:
AuthorInfo.words_df.head()

In [None]:
AuthorInfo.get_wordcount_df('Adam Geitgey', 'title').head(10)

The `words` column in `words_df` of `AuthorInfo` has word appeared in a `title (or text)` & the counts column in word_df of AuthorInfo has number of times the word appeared in a title (or text). So because of that their might be duplicate words in the words columns

But since the counts of some `duplicates` are same so it might hint that there are some duplicate rows in df.

In [None]:
# no duplicates
print(f'Number of duplicate rows: {len(df[df.duplicated()])}')

# checking duplication in author name, title text
print(f"Number of duplicate rows: {len(df[df[['author', 'title', 'text']].duplicated()])}")

# checking where these duplicates differentiate from each other
print(f"Number of duplicate rows: {len(df[df[['author', 'title', 'text', 'claps']].duplicated()])}")
print(f"Number of duplicate rows: {len(df[df[['author', 'title', 'text', 'reading_time']].duplicated()])}")
print(f"Number of duplicate rows: {len(df[df[['author', 'title', 'text', 'link']].duplicated()])}")

# so `link` is the column that differentiate duplicates

In [None]:
# duplicate rows

print(f"Number of duplicate titles: {len(df[df[['title']].duplicated()])}")
print(f"Number of duplicate texts: {len(df[df[['text']].duplicated()])}")

def get_duplicate_dfs(df, group_by, how_many=1):
    dfs = []
    
    # considering duplicates on the basis of title & text columns & then grouping them by author
    author_grp = df[df.duplicated(['title', 'text'])].groupby(by=group_by)
    
    for idx, (author, author_df) in enumerate(author_grp):
        if idx <= how_many:
            dfs.append(author_df)
        else:
            return dfs
    
    
# the `duplicated` method on df by default returns all the duplicates `except the first` 
duplicate_sample_df = get_duplicate_dfs(df, group_by='author', how_many=5)

In [None]:
def print_links(df):
    for link in df.link.values.tolist():
        print(link)

In [None]:
print_links(duplicate_sample_df[0])
duplicate_sample_df[0]

In [None]:
print_links(duplicate_sample_df[1])
duplicate_sample_df[1]

Dropping all the `duplicates` except for the first occurence since `link` column has all unique values even for the duplicates therefore removing the duplicate rows on the basis of `author`, `claps`, `title` & `text`.

In [None]:
df.drop_duplicates(['author', 'claps', 'title', 'text'], ignore_index=True, inplace=True)
len(df) # remaining rows

After this `catastrophic` event we can `re-run all of the analysis` to correct all of the `misinterpretation` happened due to these duplicate rows.

It's going to be very easy to re-run all the `analysis` as we have followed the `DRY` principle of programming.

In [None]:
wordcloud_for_title_and_text(df, wc_blue_color_func, wc_grey_color_func)

In [None]:
title_word_dict, text_word_dict = get_title_and_text_word_dict(df)

In [None]:
title_word_count_df = WordInfo.word_count_df(title_word_dict)
text_word_count_df = WordInfo.word_count_df(text_word_dict)

In [None]:
display_word_count(title_word_count_df)

In [None]:
display_word_count(text_word_count_df)

In [None]:
df[['claps', 'reading_time']].corr() 

the corr increased from `0.28 to 0.32`, but it is still a `low positive correlation` so claps and reading_time have a `very low positive correlation`.

In [None]:
display_histplot_for_reading_time(df)

**Top 100 articles with respect to claps**

In [None]:
top_atricles_wrt_claps = df.sort_values(by='claps', ascending=False).iloc[:100]
top_atricles_wrt_claps.sample(5)
wordcloud_for_title_and_text(top_atricles_wrt_claps, wc_green_color_func, wc_red_color_func)

`Resetting` the author infos with data (with `no duplicates`).

In [None]:
AuthorInfo.reset_df()

for author, author_df in top_atricles_wrt_claps.groupby(by='author'):
    AuthorInfo(author, author_df)

In [None]:
def display_avg_claps_and_avg_reading_time(df):
    f, axs = plt.subplots(1, 2, figsize=(16, 4))

    sns.scatterplot(
        x='avg_claps', y='avg_reading_time', data=df, 
        palette='mako', s=80, ax=axs[0]
    )
    sns.histplot(
        x='avg_claps', y='avg_reading_time', data=df, 
        palette='mako', ax=axs[1]
    )


display_avg_claps_and_avg_reading_time(AuthorInfo.authors_df)
    
# Articles whose reading_time is more than 12mins won't get much claps

In [None]:
def get_top_words(author_name, words_df, where, top_words):
    df = words_df[
        (AuthorInfo.words_df.authors == author_name) & (AuthorInfo.words_df['where'] == where)
    ].sort_values(by='counts', ascending=False).iloc[:top_words].values.tolist()
    
    data = {}
    for _, word, count, _ in df:
        if word in list(data.keys()):
            data[word] += count
        else:
            data[word] = count
            
    return data

    
def get_top_authors_info(authors_df, sort_by, top=5, top_words=5):
    top_author_df = authors_df.sort_values(by=sort_by, ascending=False).iloc[:top]
    df = top_author_df[['name', 'total_claps', 'total_reading_time']]
    
    for author_name, total_claps, total_reading_time in df.values:
        print(f'Author name: {author_name}')
        print(f'Total claps: {total_claps}')
        print(f'Total reading time: {total_reading_time}')
    
        top_words_in_title = get_top_words(author_name, AuthorInfo.words_df, 'title', top_words)
        top_words_in_text = get_top_words(author_name, AuthorInfo.words_df, 'text', top_words)

        print(f'Top words used in title:')
        for word, count  in top_words_in_title.items():
            print(f'\t{word} => {int(count)}x')
        print(f'Top words used in text:')
        for word, count in top_words_in_text.items():
            print(f'\t{word} => {int(count)}x')
    
        print()

**Top 5 authors info with respect to total claps**

In [None]:
get_top_authors_info(AuthorInfo.authors_df, 'total_claps')

**Top 5 authors info with respect to total reading_time**

In [None]:
get_top_authors_info(AuthorInfo.authors_df, 'total_reading_time')

---

I'll wrap things up there. If you want to find some other answers then go ahead `edit` this kernel. If you have any `questions` then do let me know. 

If this kernel helped you then don't forget to 🔼 `upvote` and share your 🎙 `feedback` on improvements of the kernel.

![](https://media.giphy.com/media/iFU36VwXUd2O43gdcr/giphy.gif)

---