## This jupyter notebook contains an analysis on the public blog post. There are about 1.8 million total posts in this data set. All of these data are publicly available on the web. The data has the following format:

{
    “blog_id”: id identifying the blog
    
    “post_id”: id identifying the post on that blog
    
    “lang”: the language code for the post (this is a combination of the user setting and our  language detection algorithm
    
    “url”: url to the post (but without ‘http(s)://’)
    
    “date_gmt”: date and time that the post was published
    (as set by the user). example: “2010-01-30 14:48:55"
    
    “title”: text title of the post
    
    “content”: text of the post with all html stripped out
    
    “author”: authors displayed name
    
    “author_login”: authors login username
    
    “author_id”: user id of the author
    
    “liker_ids”: array of user ids for people who have liked this post
    
    “like_count”: number of likes for this post
    
    “commenter_ids”: array of user ids for people who have commented on this post
    
    “comment_count”: number of comments on this post
}

## Tableau visualization link: 
https://public.tableau.com/app/profile/himani.gadve/viz/Automattic_userengagmentanalysis/Summary


In [2]:
# importing all the required packages
import os
import json
import gzip

from urllib.request import urlopen

# dataframe and series 
import pandas as pd
import numpy as np
from datetime import datetime, timedelta, date
from datetime import *

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import matplotlib

import warnings
warnings.filterwarnings('ignore')

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

plt.style.use('fivethirtyeight')
%matplotlib inline

In [3]:
#%%
#Setting Warnings Settings 
warnings.filterwarnings(action='once') 
pd.set_option('display.float_format', lambda x: '%.3f' % x) #display changed from scientific to numeric format
pd.set_option('expand_frame_repr', False)

## JSON to CSV Conversion 

In [4]:
# getting data from json.gz file
#%%
df_data = []
with gzip.open('posts.jsonl.gz') as data:
    for i in data:
        df_data.append(json.loads(i.strip()))

  and should_run_async(code)


In [5]:
#%%
# to see the length of the data, it means total number of likes and comments also
print(len(df_data))

# to see the first row of the list
print(df_data[0])

1809199
{'comment_count': 0, 'content': "The Snap! Jamaal Jackson tore knee ligaments in their win over the Denver Broncos. Making his first NFL start at center will be Nick Cole. Cole filled in at center against the Broncos, but McNabb fumbled two snaps, and three offensive linemen were penalized for false starts. While you can&#39;t blame all the false starts on the new center, it certainly doesn&#39;t make things easier for the linemen. \xa0 The good news is that Brian Westbrook is back after missing 5 games. His performance last week against the Broncos was promising, he had 9 carries for a total of 32 yards. In his last game against Dallas, Westbrook carried the ball 13 times for 50 yards rushing. \xa0 Does it matter that Michael Vick may not be available to play? Probably not, but we&#39;d certainly like to see him out on the field for a few key plays.\xa0If McNabb gets hurt, Kevin Kolb\xa0who proved himself against the Chiefs\xa0can help the Eagles soar to victory. \xa0 Bring ou

In [6]:
#%%
df = pd.DataFrame.from_dict(df_data) # convert dictionary to dataframe

In [7]:
#%%
df.to_csv('blog_post.csv', index = False) # to use easily everytime, I write it to csv

In [8]:
#%%
pd.options.display.max_columns=100 # To see the hidden columns in dataframe

# Exploratory Data Analysis
Let’s have a look at data dimensionality, feature names, and feature types.

In [1]:
#%%
df = pd.read_csv('blog_post.csv', parse_dates=['date_gmt'], low_memory=False) #getting main data
df.head()

NameError: name 'pd' is not defined

In [None]:
#%%
# From the output, we can see that the table contains 1973758 rows and 14 columns.
df.shape

In [None]:
#%%
df.isna().sum()

### Checking for missing values in key dimensions: 'blog_id','author_id','post_id'

In [None]:
#%%
# making new data frame with dropped NA values
df1 = df.dropna(axis = 0, subset = ['blog_id','author_id','post_id'], how = 'all')
  
# comparing sizes of data frames
print("Old data frame length:", len(df), "\nNew data frame length:", 
       len(df1), "\nNumber of rows with at least 1 NA value: ",
       (len(df)-len(df1)))

In [None]:
#%%
# droppping all the records with null values for 'blog_id','author_id','post_id'
df.dropna(axis = 0, subset = ['blog_id','author_id','post_id'], how = 'all',inplace = True)

In [None]:
#%%
# converting date from string to datetime and dropping dates which are not in the correct formate
df['date_gmt'] = pd.to_datetime(df['date_gmt'].str[:10],  infer_datetime_format=True, errors = 'coerce')
# drop null dates from analysis
df.dropna(axis = 0, subset = ['date_gmt'], how = 'all',inplace = True)
print(df['date_gmt'].isnull().sum())

In [None]:
#%%
# pd.to_datetime(df['date_gmt'].str[:10],  infer_datetime_format=True)
# df['date_gmt'] = pd.to_datetime(df['date_gmt'],format='%d/%m/%Y', errors = 'coerce')
df['comment_count'] = pd.to_numeric(df['comment_count'], errors='coerce')
df["comment_count"] = pd.to_numeric(df["comment_count"], downcast="float")

### After converting a missing value, data looks good for key dimesions

In [None]:
#%%
df.isnull().sum()

In [None]:
#%%
df.describe()

## What are the median and the mean numbers of likes per post in this data sample?

In [None]:
#%%
df_meadian=df.groupby(['post_id']).agg({'like_count':np.median}).reset_index()
df_meadian

In [None]:
#%%
df_mean=df.groupby(['post_id']).agg({'like_count':np.mean}).reset_index()
df_mean

### Findings:

By comparing the mean and median of likes count per post for the first 5 post, median value looks to be zero and mean value looks like 1 to 6 likes per post in first five post.

## Visualize the total daily count of likes vs total daily comments for this data sample

To better understand the Daily count of likes vs comments, we need to first understand the timeseries data for all the years which can be possible by dividing analysis in 3 phases;
- Yearly Trends
- Monthly Trends
- Daily Trends based on the cohort filtering

In [None]:
#%%
# converting object to datetime
df['date_gmt'] =  pd.to_datetime(df['date_gmt'],  infer_datetime_format=True)

In [None]:
#%%
# total daily count of likes vs total daily comments 
df1 = df.groupby(df.date_gmt.dt.date).agg({'like_count':'sum','comment_count':'sum'}).reset_index()
df1.sort_values('date_gmt',ascending = False).head(10)

## Yearly trend Analysis

In [None]:
#%%
df['Year_gmt'] = pd.to_datetime(df['date_gmt']).apply(lambda x: '{year}'.format(year=x.year))
yearlydf = df.groupby('Year_gmt').agg({'like_count':'sum','comment_count':'sum'}).reset_index().sort_values('Year_gmt',ascending= True)
yearlydf['pct_like'] = yearlydf['like_count'].pct_change() * 100
yearlydf['pct_comment'] = yearlydf['comment_count'].pct_change() * 100
yearlydf.sort_values('Year_gmt',ascending= False)



**Findings:**

The yearly trend shows that likes count started growing from 2005 and increased drastically over the year with increase in 984k % change compare to 2005. Furthermore, for Comments started growing from 2002 and increased exponentially over the years.

In [None]:
#%%

#plotting totals 
plt.figure(figsize=(20,6))
ax1 = plt.subplot(111)
sns.lineplot(data=yearlydf, x='Year_gmt', y='like_count', palette = 'blue', label ='like_count')
# total comments
sns.lineplot(data=yearlydf, x='Year_gmt', y='comment_count', palette = 'orange',label = 'comment_count')
plt.xticks(rotation = 75, horizontalalignment='center', fontweight='light', fontsize=10)
plt.title("YEAR: like count vs comment count comparison")
plt.legend(loc='upper left')
plt.gcf().autofmt_xdate()
plt.show()

### Filtered data before 2002 as comments started growing from year 2002

In [None]:
#%%
# Filtered data below year 2002
yearlydf1 = yearlydf[yearlydf['Year_gmt'] > '2002'].copy()

#plotting totals 
plt.figure(figsize=(20,4))
ax1 = plt.subplot(111)
sns.lineplot(data=yearlydf1, x='Year_gmt', y='like_count', palette = 'blue', label ='like_count')
# total comments
sns.lineplot(data=yearlydf1, x='Year_gmt', y='comment_count', palette = 'orange',label = 'comment_count')
plt.xticks(rotation = 75, horizontalalignment='center', fontweight='light', fontsize=10)
plt.title("YEAR: like count vs comment count comparison")
plt.legend(loc='upper left')
plt.gcf().autofmt_xdate()
plt.show()

In [None]:
#%%
matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots(figsize=(15,3))
sns.barplot(data = yearlydf1, x='Year_gmt', y='comment_count', alpha=0.5)

In [None]:
#%%
yearlydf2 = yearlydf[yearlydf['Year_gmt'] > '2009'].copy()

matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots(figsize=(15,3))
sns.barplot(data = yearlydf2, x='Year_gmt', y='like_count', alpha=0.5)

###  Findings: 
It is evident that like counts are maximum for the year 2013, 2014, 2015

## Daily trend Analysis

In [None]:
#%%
import datetime
dfm = df[df.date_gmt.dt.date > datetime.date(2010,12,1)].copy()
df2 = dfm.groupby(dfm.date_gmt.dt.date).agg({'like_count':'sum','comment_count':'sum'}).reset_index()
#plotting totals 
plt.figure(figsize=(22,8))
ax1 = plt.subplot(111)
sns.lineplot(data=df2, x='date_gmt', y='like_count', palette = 'blue', label ='Daily like_count')
plt.xticks(rotation = 75, horizontalalignment='center', fontweight='light', fontsize=10)
plt.title("Daily: total likes", fontsize=20)
plt.legend(loc='upper right')
plt.gcf().autofmt_xdate()
plt.show()

### Findings:
It looks like a sudden spike in the daily like counts in the end of 2014


In [None]:
#%%
df['YearMonth'] = pd.to_datetime(df['date_gmt']).apply(lambda x: '{year}-{month}'.format(year=x.year,month=x.month))
monthdf = df.groupby('YearMonth').agg({'like_count':'sum','comment_count':'sum'}).reset_index().sort_values('YearMonth',ascending= True)
monthdf['pct_like'] = monthdf['like_count'].pct_change() * 100
monthdf['pct_comment'] = monthdf['comment_count'].pct_change() * 100
monthdf.sort_values('YearMonth',ascending= False)

In [None]:
#%%
monthdf1 = monthdf[monthdf['YearMonth'] > '2006-01'].copy()
monthdf1.sort_values('YearMonth',inplace=True)
monthdf1.set_index('YearMonth')['comment_count'].plot(kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
plt.title("Total Monthly comment count", y=1.013, fontsize=20)
plt.ylabel("Sum [comment count]", labelpad=10)
plt.xlabel("Date [Month - Year]", labelpad=10);

### Findings:
It shows the spike in the comments counts for the last 3 months of the year end such as September, October, November, December

# Q3: How would you determine which authors are the most “popular”? What additional data would you need?

 First step to check if there are any null authors present in the analysis and understand the total cohort size

In [None]:
#%%
# check nulls in the authors count
print(f'Total Authors: {df["author_id"].count()}')
print(f'Null Authors: {df["author_id"].isnull().sum()}')

In order to understand the popularity of any author first steps need to do is that to aggregate data based on the below given key dimension:
- Count of Post_id - To understand how many post author has done
- Count of blog_id - To understand how many blogs author has publishes
- Sum of total comment count - Provides good measure of popularity as reader has shown interest in the authors post and commented on it
- Sum of total likes count - provides insights on liking or unliking of the posts

In [None]:
#%%
authdf =  df.groupby(["author_id","author"]).agg({"post_id":'count',"blog_id":'count',"comment_count":'sum',"like_count":'sum'}).reset_index()
authdf.sort_values(by = ['comment_count',"like_count",'post_id'],ascending=False).head(10)


### Top 10 authors based on the Number of posts 
- Found out that author **"Eli Nathanael"** has made 42760 posts which is 2.3% of total posts but did not receive much engagement from users after looking at the number of comments and likes 
- second highest post has been made by author **FLYNN** with post 42760 which accounts for 1.2% of total and received good support and engagement from users in terms of the number of comments and likes

In [None]:
#%%
# Top 10 authors based on the Number of posts
authdf["pct_total_post"] = (authdf["post_id"] / authdf["post_id"].sum()) * 100
authdf.sort_values(by = ['post_id'],ascending=False).head(10)

# Popularity based on the Engagement metrics:
To understand further the popularity of the author we need to look at 2 key metrics which are total comments made on the post which shows engagement from users and other important metric is total likes received which shows the 

**Likes**: Liking a post is easy. Liking takes only one click. when someone likes your post, you don’t have the opportunity to engage back.

**Comments**: This allows a user to type a response to the post.

As with likes, the more comments a post receives, the more people will see the post. However, comments have more weight than likes in social media algorithms, so **five comments are worth more than five likes**. Comments are great because they give the opportunity for you to continue the conversation. When a user comments, you can like their comment and reply to it. When you comment on other pages, you stand out from the crowded likes list and many pages will reply to you.

**Derived Metric:**
- Total Comments received
- **Average comments per post**: This shows the average number of comments done on a post 

In [None]:
#%%
authdf["Avg_comments_per_post"] = authdf["comment_count"] / authdf["post_id"]
authdf.sort_values(by = ['Avg_comments_per_post'],ascending=False).head(10)

**Findings**

After looking at total data for total comment per post found out that author **tonyedwards** have made 162 posts in his total duration on the app and **received 132k comments** which means **received highest ~816 comments per post** but did not recieve any likes which is interesting. There can be multiple reasons and to understand better we need to check yearly comments data to see if likes feature was available at that time

In [None]:
#%%
commentdf = authdf.sort_values(by = ['Avg_comments_per_post'],ascending=False).head(20)

matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots(figsize=(8,6))
sns.barplot(data = commentdf, x='Avg_comments_per_post', y='author', alpha=0.5)

### Popularity based on Likes

**Derived Metric:**
- Total likes received
- **Total likes per post**: This shows the average number of likes recieved on a post 

In [None]:
#%%
authdf["avg_likes_per_post"] = authdf["like_count"] / authdf["post_id"]
authdf.sort_values(by = ['avg_likes_per_post'],ascending=False).head(10)

**Findings**

After looking at total data for Average likes per post found out that author **Nicole Marie** has made only 1 post in his total duration on the app and received 241 likes which means **received highest ~241 likes per post** and also recieved 78 comments shows that high engagement from users on that one post.

In [None]:
#%%
likesdf = authdf.sort_values(by = ['avg_likes_per_post'],ascending=False).head(20)

matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots(figsize=(8,6))
sns.barplot(data = likesdf, x='avg_likes_per_post', y='author', alpha=0.5)

## Popularity based on Yearly trend

In [None]:
#%%
authdf1 =  df.groupby(["author_id",'author','Year_gmt']).agg({"post_id":'count',"blog_id":'count',"comment_count":'sum',"like_count":'sum'}).reset_index()
# Creating metrics to better understand engagement 
authdf1["avg_likes_per_post"] = authdf1["like_count"] / authdf1["post_id"]
authdf1["Avg_comments_per_post"] = authdf1["comment_count"] / authdf1["post_id"]
authdf1.sort_values(by = ['comment_count',"like_count"],ascending=False).head(5)


In [None]:
#%%
authdf1 = authdf1[authdf1['Year_gmt'] > '2002'].copy()
authdf2 = authdf1.sort_values(['Year_gmt','comment_count'], ascending=False).groupby('Year_gmt').head(1).copy()
authdf2

In [None]:
#%%
matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots(figsize=(8,6))
sns.barplot(data = authdf2, x='comment_count', y='author', alpha=0.5)

### Findings

After looking at total data for Average comments count per post found out that author Tonyedwards has more than 120000 comments counts in his total duration on the app. Seconds highest comments counts receiver is Ron DuBour who received around 60000 comments counts which can be interpreted as high engagement from these two authors.

# What additional data would you need?¶

- The key missing feature in this data is the **actual comments given by users**. it would have helped use to perform **sentimental analysis** to understand the emotions and liking for each post. 

**User acquisition Funnel analysis**
- **First post/Blog date or Author signup date**: signup or first post date would have helped us to create retention analyis to understand the health of the user acquision funnel
- Missing details about users and their signup to first comment/like dates or flag

**Demographic data**
- Missing demographic data like age, gender, education would have helped use to better segment data into multiple funnels
- Author type: eg, poet, Novelist, writer, Lyricist, screen writer, play write
- Post/Blog type: eg.music, sport, reviews, news, updates


# What is a unique insight you can provide us about the data?

### Key Metrics yearly trend analysis:
- Total Active authors per year
- Total Posts per year
- Total Comments received
- Average comments per post per year: This shows the average number of comments done on a post 
- Total Likes received
- Average likes per post per year : This shows the average number of comments done on a post 

In [None]:
#%%
insdf =  df.groupby(['Year_gmt']).agg({'author_id':'count','blog_id':'count','post_id':'count','comment_count':'sum','like_count':'sum'}).reset_index()

# Creating metrics to better understand engagement
insdf["avg_likes_per_post"] = (insdf["like_count"] / insdf["post_id"]) * 100
insdf["Avg_comments_per_post"] = (insdf["comment_count"] / insdf["post_id"]) * 100
insdf["Avg_comments_per_author"] = (insdf["comment_count"] / insdf["author_id"]) * 100
insdf.sort_values(by = ['Year_gmt'],ascending=False).head(15)


**Insights:**
    
- **Yearly Active Authors + New user Analysis**: The data shows that active Author counts doubled every year with growth rate of ~200% starting from 2007 and reached till ~450% in year 2011 and 2012 and reached to highest in 2014 which was 619470
- **User Engagement**: Key metrics to measure engagement is number comments and likes on the post and allows users to interact with Authors which keeps platform alive. So, decided to measure engagement using to 2 key metrics as follows:
- avg comments per post in a year & avg like per post in a year:
It shows that starting from 2010 comments per post count increased drastically with avg. count of ~300 comments per posts 


In [None]:
#%%
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(15,60))
fig.suptitle(f"Key Metrics Analysis: Authors",fontsize=30)
insdf1 = insdf[insdf['Year_gmt'] > '2005-01'].copy()
insdf1.sort_values('Year_gmt',inplace=True)
# Author count
insdf1.set_index('Year_gmt')['author_id'].plot(ax=axes[0, 0],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[0,0].set_title(f"Total Yearly Author Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[0,0].set_ylabel("Unique Author count]", labelpad=10)
axes[0,0].set_xlabel("Date [Year]", labelpad=10)

# post Count
insdf1.set_index('Year_gmt')['post_id'].plot(ax=axes[0, 1],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[0,1].set_title(f"Total Yearly blog Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[0,1].set_ylabel("Sum [post count]", labelpad=10)
axes[0,1].set_xlabel("Date [Year]", labelpad=10)

# Total Yearly Comment Count
insdf1.set_index('Year_gmt')['comment_count'].plot(ax=axes[1, 0],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[1,0].set_title(f"Total Yearly Comment Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[1,0].set_ylabel("Sum [comment count]", labelpad=10)
axes[1,0].set_xlabel("Date [Year]", labelpad=10)

# Total Yearly like Count
insdf1.set_index('Year_gmt')['like_count'].plot(ax=axes[1, 1],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[1,1].set_title(f"Total Yearly likes Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[1,1].set_ylabel("Sum [likes count]", labelpad=10)
axes[1,1].set_xlabel("Date [Year]", labelpad=10)

# Avg_comments_per_post
insdf1.set_index('Year_gmt')['Avg_comments_per_post'].plot(ax=axes[2, 0],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[2,0].set_title(f"Avg_comments_per_post",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[2,0].set_ylabel("Avg_comments_per_post", labelpad=10)
axes[2,0].set_xlabel("Date [Year]", labelpad=10)

# avg_likes_per_post
insdf1.set_index('Year_gmt')['avg_likes_per_post'].plot(ax=axes[2, 1],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[2,1].set_title(f"Avg_likes_per_post",fontsize=20)
plt.title("Avg_likes_per_post", y=1.013, fontsize=20)
axes[2,1].set_ylabel("avg_likes_per_post", labelpad=10)
axes[2,1].set_xlabel("Date [Year]", labelpad=10)



### Findings:

1. Total yearly author count: Total yearly author count is highest in the year 2014
2. Total yearly blog count: Total yearly blog count is highest in the year 2014
3. Total yearly comment count: Total yearly comment count is highest in the year 2014
4. Total yearly likes count: Total yearly author count is highest in the year 2014 
    and there were no likes on the post prior to 2011
5. Average comment per post: Average comment per post has been significantly high since 2006. 
    With highest comment counts in year 2011 and 2012
6. Average likes per post: Average likes per post has been huge since 2013 to 2015. 
    Likes has become stifer on the post since 2010 


In [None]:
#%%
# Creating metrics to better understand engagement 
authdf1["avg_likes_per_post"] = authdf1["like_count"] / authdf1["post_id"]
authdf1["Avg_comments_per_post"] = authdf1["comment_count"] / authdf1["post_id"]
authdf1.sort_values(by = ['comment_count',"like_count"],ascending=False).head(5)


## Total Blogs by Author analysis

In [None]:
#%%
blogdf = df.groupby(["author_id",'author']).agg({"blog_id":"count"}).reset_index().sort_values("blog_id",ascending=False).head(20)

blogdf

In [None]:
#%%
matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots(figsize=(8,6))
sns.barplot(data = blogdf, x='blog_id', y='author', alpha=0.5)

### Findings:

1. Author named as 'Eli nathnael' has written a highest numbers of blogs
2. Author named as 'walter' has written a lowest numbers of blogs in the top 20 writers as per the count of written blogs

# Key metrics by language analysis

In [None]:
#%%
langdf =  df.groupby(['lang']).agg({'author_id':'count','blog_id':'count','post_id':'count','comment_count':'sum','like_count':'sum'}).reset_index()
# Creating metrics to better understand engagement
langdf["avg_likes_per_post"] = (langdf["like_count"] / langdf["post_id"]) * 100
langdf["Avg_comments_per_post"] = (langdf["comment_count"] / langdf["post_id"]) * 100
langdf["Avg_comments_per_author"] = (langdf["comment_count"] / langdf["author_id"]) * 100
langdf["pct_total_authors"] = (langdf["author_id"]/langdf["author_id"].sum()) * 100
langdf.sort_values(by = ["comment_count"],ascending=False).head(15)


**Findings**

Found out that **92% of authors have written blogs/posts in the "English" language** and received the highest number of comments and likes
    

In [None]:
#%%
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(15,50))
fig.suptitle(f"Key Metrics Analysis by Language",fontsize=30)
# langdf = insdf[insdf['lang'] > '2005-01'].copy()
langdf = langdf.sort_values(by = ["comment_count"],ascending=False).head(10).copy()
# Author count
langdf.set_index('lang')['author_id'].plot(ax=axes[0, 0],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[0,0].set_title(f"Total Yearly Author Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[0,0].set_ylabel("Unique Author count]", labelpad=10)
axes[0,0].set_xlabel("Language", labelpad=10)

# post Count
langdf.set_index('lang')['post_id'].plot(ax=axes[0, 1],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[0,1].set_title(f"Total Yearly blog Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[0,1].set_ylabel("Sum [post count]", labelpad=10)
axes[0,1].set_xlabel("Language", labelpad=10)

# Total Yearly Comment Count
langdf.set_index('lang')['comment_count'].plot(ax=axes[1, 0],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[1,0].set_title(f"Total Yearly Comment Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[1,0].set_ylabel("Sum [comment count]", labelpad=10)
axes[1,0].set_xlabel("Language", labelpad=10)

# Total Yearly like Count
langdf.set_index('lang')['like_count'].plot(ax=axes[1, 1],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[1,1].set_title(f"Total Yearly likes Count",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[1,1].set_ylabel("Sum [likes count]", labelpad=10)
axes[1,1].set_xlabel("Language", labelpad=10)

# Avg_comments_per_post
langdf.set_index('lang')['Avg_comments_per_post'].plot(ax=axes[2, 0],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[2,0].set_title(f"Avg_comments_per_post",fontsize=20)
plt.title("", y=1.013, fontsize=20)
axes[2,0].set_ylabel("Avg_comments_per_post", labelpad=10)
axes[2,0].set_xlabel("Language", labelpad=10)

# avg_likes_per_post
langdf.set_index('lang')['avg_likes_per_post'].plot(ax=axes[2, 1],kind='bar',figsize=(30, 10),color='cadetblue',rot=90, fontsize=12)
axes[2,1].set_title(f"Avg_likes_per_post",fontsize=20)
plt.title("Avg_likes_per_post", y=1.013, fontsize=20)
axes[2,1].set_ylabel("avg_likes_per_post", labelpad=10)
axes[2,1].set_xlabel("Language", labelpad=10)



### Findings:

1. Total yearly author count: Total yearly author count is highest in the language 'English'
2. Total yearly blog count: Total yearly blog count is highest in the language 'English'
3. Total yearly comment count: Total yearly comment count is highest in the language 'English'
4. Total yearly likes count: Total yearly author count is highest in the language 'English'
5. Average comment per post: Average comment per post is surprisingly low in language 'English' 
    and high in language 'tl' and language 'fr'
6. Average likes per post: Average likes per post has been normally distributed for the language other than 'English'.
For language 'English', Average likes per post are highest



## Tool used for this analysis are as follows:
Json
Gzip - to unzip file from json to csv
import os import json import gzip

from urllib.request import urlopen

## Dataframe and Series

import pandas as pd import numpy as np from datetime import datetime, timedelta, date from datetime import *

import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px import matplotlib

import warnings warnings.filterwarnings('ignore')

import warnings warnings.filterwarnings("ignore", category=DeprecationWarning)

plt.style.use('fivethirtyeight') %matplotlib inline

In [3]:
import pandas as pd

In [64]:
dicts = {}

In [67]:
# dicts["sdf"]= dict(14)


dicts = { 'sdf': {'20': '120'},
                'himani': {'22': '50'},
               'Rahul': {'22': '70'},
              }

In [60]:
# dicts["vela"]= [14,4]

In [68]:
dicts

{'sdf': {'20': '120'}, 'himani': {'22': '50'}, 'Rahul': {'22': '70'}}

In [69]:
for key, values in dicts.items():
    print(key,values)

sdf {'20': '120'}
himani {'22': '50'}
Rahul {'22': '70'}


In [91]:
import pandas as pd
val=[]
for i, values in enumerate(dicts.values()):
    print(values.items())
#     val.append(values[0])
# print(max(val))

dict_items([('car', 'abc'), ('score', '15'), ('rental_rate', '120')])
dict_items([('car', 'xyz'), ('score', '20'), ('rental_rate', '50')])
dict_items([('car', 'pqr'), ('score', '20'), ('rental_rate', '70')])


In [111]:
df = pd.DataFrame.from_dict(dicts) # convert dictionary to dataframe
df


Unnamed: 0,1,2,3
score,15,20,20
rental_rate,120,50,70


In [109]:
# df.sort(by = [score, rantal_rate], ascending=[False, True])  
# (axis=1).nlargest(2)
df.sort_values(by=['score', 'rental_rate'], ascending=[False, True])

KeyError: 'score'

In [89]:
dicts = { '1': {"car": 'abc', "score": '15', 'rental_rate':'120'},
         '2': {"car": 'xyz',"score": '20', 'rental_rate':'50'},
          '3': {"car": 'pqr', "score": '20', 'rental_rate':'70'}
        }

In [81]:
dicts = {"car": 'abc', "score": '15', 'rental_rate':'120'},
         {"car": 'xyz',"score": '20', 'rental_rate':'50'},
          {"car": 'pqr', "score": '20', 'rental_rate':'70'}

              

IndentationError: unexpected indent (<ipython-input-81-b1a5e0ef6f3d>, line 2)

In [92]:
dicts = { '1': { "score": '15', 'rental_rate':'120'},
         '2': {"score": '20', 'rental_rate':'50'},
          '3': { "score": '20', 'rental_rate':'70'}
        }