### In this notebook we are going to do data cleaning and exploration. As the original news category dataset is large (200k records).
### we will filter the data set and optimize it for our NLP implementation
### Hopefully it will help fellow beginners in getting started with data cleaning.*
### lets start..

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json        # deals with json files
import seaborn as sns  #visualization library
import matplotlib.pyplot as plt
from datetime import datetime

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [4]:
with open("../input/news-category-dataset/News_Category_Dataset_v2.json",mode='r') as json_file :
    List_of_dict=[ json.loads(line) for line in json_file ]    
        
df=pd.DataFrame(List_of_dict)
df.sample(10)

In [5]:
print(df.dtypes)
#here date needs to be in datetime format.Lets do it now

df.date= pd.to_datetime(df.date)

In [6]:
# straight away we can see empty entries. So lets convert it to nans and handle them.
df.replace('',np.nan,inplace=True)

In [7]:
df.isnull().sum() 
# So short_description  and authors have most nan values

In [8]:
# Short_description is an important feature and contains nan values. So dropping nans present in it
df.dropna(axis=0,how='all',subset=['short_description'],inplace=True)

In [9]:
fig=plt.figure(figsize=(8,8))
sns.countplot(df.category)
plt.xticks(rotation=90)
fig.suptitle('Number of articles( for each category)')


# politics and wellness seem to have most number of articles

In [10]:
date_by_category=df.groupby(['date'])['category'].count().reset_index()

In [11]:
year=date_by_category.date.apply(lambda x: datetime.date(x).year)
articles_per_year=year.value_counts().reset_index()
articles_per_year.columns=['year','number_of articles']
sns.barplot(articles_per_year.year,articles_per_year['number_of articles'],data=articles_per_year)

# year(2013-2017 had most articles) while 2018 had the least articles

In [12]:
month= date_by_category.date.apply(lambda x: datetime.date(x).month)
articles_per_month=month.value_counts().reset_index()
articles_per_month.columns=['month_num','number_of articles']
sns.barplot(articles_per_month['month_num'],articles_per_month['number_of articles'],data=articles_per_month)

# march & may had most while june had least number of articles

In [13]:
df.drop(['authors','date'],axis=1,inplace=True)

## dropping date (its not helpful identifying the category of an article) and authors (it contains names and designation which are not required)


In [14]:
df.link[52]

### * I think i saw something useful in df ["link"].
### * If i can extract the keywords in a url then it would be great for nlp. 
### * For example= 'https://www.huffingtonpost.com/entry/hollywood-doesnt-need-difficult-men-to-make-great-tv_us_5b080bcce4b0568a880aa6d9'. 
### * In this example "hollywood-doesnt-need-difficult-men-to-make-great-tv" is a keyword which helps to represent Entertainment category.
### lets go..

In [15]:
df=df.reset_index(drop=True)

In [None]:
# for each record in df.link we split it on '_' then split on 'y/' to get the exact set of keywords
# then we pass records which find 'http'(2 times) in a single url

df['keywords']=''
try:
    for i in range(len(df.link)):
        if df['link'][i].count("http")>=2:
            pass
            
        else:
            df['keywords'][i]= df['link'][i].split('_')[0].split('y/')[1]

except IndexError:
    pass

### This step is optional: I want to make sure that each label in our target column(category) has equal number of corresponding reccords.
### example: Politics has 29k records but entertainment has 13k records
### by doing this step we will eliminate imbalanced category

In [None]:
# I am going to take 5k random reccords from 6 category (politics, wellness, food, sports, business,world news)

#worldpost and theworldpost belongs to world news
df.category = df.category.map(lambda x: "WORLD NEWS" if x == "THE WORLDPOST" or x=="WORLDPOST" else x) 


df_politics=df.loc[df['category']=='POLITICS'].sample(5000)
df_wellness= df.loc[df['category']=='WELLNESS'].sample(5000)
df_food= df.loc[df['category']=='FOOD & DRINK'].sample(5000)
df_world_news=df.loc[df['category']=="WORLD NEWS"].sample(5000)
df_parenting=df.loc[df['category']=='PARENTING'].sample(5000)
df_business=df.loc[df['category']=='BUSINESS'].sample(5000,replace=True)

In [None]:
df.to_csv('final_news_df.csv',index=False) #exporting to csv file

### Thanks for tuning in. We will continue the visualization in the next notebook. Hope it helped my fellow beginners !!

#### Also  I have cleaned up the new csv file a little more. Checkout the dataset: [https://www.kaggle.com/setseries/news-category-dataset]  