## Cleaning Data
> We will clean the data from `news_categories.csv` file and take what we will use to classify the `News Categories`

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import csv
import nltk

In [2]:
path_name = "news_categories.csv"

In [10]:
news = pd.read_csv(path_name)
news.head(10)

Unnamed: 0,category,country_code,urlToImage,author,title,description,url,publishedAt,content
0,BUSINESS,za,https://cdn.24.co.za/files/Cms/General/d/6839/...,,CEO of $2bn start-up ousted for microdosing LS...,,https://www.news24.com/fin24/economy/world/ceo...,2021-04-28T07:15:39Z,Marketing startup Iterable dismissed its chief...
1,BUSINESS,za,https://businesstech.co.za/news/wp-content/upl...,https://www.facebook.com/BusinessTechSA,Mango flight booking warning - BusinessTech,State-owned domestic airline Mango is facing f...,https://businesstech.co.za/news/business/48629...,2021-04-28T07:09:14Z,State-owned domestic airline Mango is facing f...
2,BUSINESS,za,https://businesstech.co.za/news/wp-content/upl...,https://www.facebook.com/BusinessTechSA,Absa launches QR payments - BusinessTech,Absa has launched QR payments as part of a gro...,https://businesstech.co.za/news/banking/486249...,2021-04-28T06:08:33Z,Absa has launched QR payments as part of a pus...
3,BUSINESS,za,https://cdn.24.co.za/files/Cms/General/d/8065/...,,Spotify undercuts Apple with new ways for podc...,Spotify is rolling out new ways for podcasters...,https://www.news24.com/fin24/companies/ict/spo...,2021-04-28T03:19:30Z,Spotify is rolling out new ways for podcasters...
4,BUSINESS,za,https://cdn.24.co.za/files/Cms/General/d/2925/...,,"REVIEW | Subaru Forester is a formidable SUV, ...",The Subaru Forester has been refreshed for 202...,https://www.news24.com/wheels/offroad_and_4x4/...,2021-04-28T02:31:13Z,"• Subaru's new, updated Forester is now availa..."
5,BUSINESS,za,https://www.dailymaverick.co.za/wp-content/upl...,J Brooks Spector,"Scams, cons and stupidities: History is replet...",The White Spiritual Boy Trust offers a way to ...,https://www.dailymaverick.co.za/article/2021-0...,2021-04-27T19:54:28Z,"(Photo: Adobe Stock) \r\nThe point is, ladies ..."
6,BUSINESS,za,http://cdn.24.co.za/files/Cms/General/d/11141/...,,SpaceX was approved to fly thousands more sate...,"After a near-miss earlier this month, several ...",https://www.businessinsider.com/elon-musk-spac...,2021-04-27T19:11:07Z,The American Federal Communications Commission...
7,BUSINESS,za,https://mybroadband.co.za/news/wp-content/uplo...,https://www.facebook.com/mybroadband,One in four mobile subscriptions fraudulent in...,Mobile subscription fraud continues to remain ...,https://mybroadband.co.za/news/cellular/394507...,2021-04-27T14:04:17Z,Mobile subscription fraud continues to remain ...
8,BUSINESS,za,https://businesstech.co.za/news/wp-content/upl...,https://www.facebook.com/BusinessTechSA,Why offshore investments continue to be an att...,In the current investment environment offshore...,https://businesstech.co.za/news/banking/483135...,2021-04-27T09:00:30Z,In the current investment environment offshore...
9,BUSINESS,za,https://mybroadband.co.za/news/wp-content/uplo...,https://www.facebook.com/mybroadband,Tesla sold Bitcoin to prove cryptocurrency’s l...,Elon Musk said Tesla Inc. sold 10% of its Bitc...,https://mybroadband.co.za/news/cryptocurrency/...,2021-04-27T05:26:22Z,Elon Musk said Tesla Inc. sold 10% of its Bitc...


> We are interested in `category` and `description`, which means we will classify our news based on `description`

In [12]:
categories = news.category.values
description = news.description.values

In [14]:
len(categories), len(description), description[:2], categories[:2]

(5279,
 5279,
 array([nan,
        'State-owned domestic airline Mango is facing financial uncertainty, raising questions about how much longer it can stay in the air.'],
       dtype=object),
 array(['BUSINESS', 'BUSINESS'], dtype=object))

> We want to remove all news with `nan` description from our data

In [62]:
news = []

for i in range(len(categories)):
    if str(description[i]) != 'nan':
        news.append([categories[i], description[i]])
        

In [63]:
len(news),  news[:5]

(4706,
 [['BUSINESS',
   'State-owned domestic airline Mango is facing financial uncertainty, raising questions about how much longer it can stay in the air.'],
  ['BUSINESS',
   'Absa has launched QR payments as part of a growing push to reduce physical transactions.'],
  ['BUSINESS',
   'Spotify is rolling out new ways for podcasters to make money from their shows, stepping up efforts after a recent move by Apple to attract talent to its platform.'],
  ['BUSINESS',
   'The Subaru Forester has been refreshed for 2021. We drive the version with the small(er) 2.0-litre petrol engine.'],
  ['BUSINESS',
   'The White Spiritual Boy Trust offers a way to think about scams and cons — and to realise this current imbroglio has lots of historical antecedents.']])

> Now we are left with `4706` news list. Let's think also about what else can we remove, numbers and punctuations, converting all sents to lowercase.

In [64]:
import re

In [82]:
news_lower_clean = []
for cat, desc in news:
    # remove non-word characters, punctuations
    new__1 = re.sub(r"\W", " ",desc.lower())
    # remove digits
    new__2 = re.sub(r"\d", "",new__1)
    news_lower_clean.append([cat, new__2])

In [83]:
len(news_lower_clean),  news_lower_clean[:5]

(4706,
 [['BUSINESS',
   'state owned domestic airline mango is facing financial uncertainty  raising questions about how much longer it can stay in the air '],
  ['BUSINESS',
   'absa has launched qr payments as part of a growing push to reduce physical transactions '],
  ['BUSINESS',
   'spotify is rolling out new ways for podcasters to make money from their shows  stepping up efforts after a recent move by apple to attract talent to its platform '],
  ['BUSINESS',
   'the subaru forester has been refreshed for   we drive the version with the small er    litre petrol engine '],
  ['BUSINESS',
   'the white spiritual boy trust offers a way to think about scams and cons   and to realise this current imbroglio has lots of historical antecedents ']])

> Remove `stopwords` from the each news description

In [100]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = stopwords.words()

> We want to remove all `stopwords` for all languages.

In [116]:
news_lower_clean_stopwords = []
for cat, desc in news_lower_clean:
    word_tokens =[ word for word in word_tokenize(desc) if word not in stop_words]
    new__1 = ' '.join(word_tokens)

    news_lower_clean_stopwords.append([cat, new__1])


In [117]:
len(news_lower_clean_stopwords),  news_lower_clean_stopwords[:5]

(4706,
 [['BUSINESS',
   'state owned domestic airline mango facing financial uncertainty raising questions much longer stay air'],
  ['BUSINESS',
   'absa launched qr payments part growing push reduce physical transactions'],
  ['BUSINESS',
   'spotify rolling new ways podcasters make money shows stepping efforts recent move apple attract talent platform'],
  ['BUSINESS',
   'subaru forester refreshed drive version small litre petrol engine'],
  ['BUSINESS',
   'white spiritual boy trust offers way think scams cons realise current imbroglio lots historical antecedents']])

> Now our data is becoming more `cleaner` what if we `lemmatize` some words. We can also correct spellings. We are going to use the `WordnetLemmatizer` to do that.

In [118]:
from nltk.stem import WordNetLemmatizer

In [119]:
lemmatizer = WordNetLemmatizer()

In [None]:
cleaned_news = []
for cat, desc in news_lower_clean:
    word_tokens =[ word for word in word_tokenize(desc) if word not in stop_words]
    new__1 = ' '.join(word_tokens)

    cleaned_news.append([cat, new__1])
