# BBC News Classification Project

There exists a large amount of information being stored in the electronic format. With such data it has
become a necessity of such means that could interpret and analyse such data and extract such facts that could
help in decision-making.<br>
News is easily accessible via content providers such as online news services. A huge amount of information exists in form of text in various diverse areas whose analysis can be beneficial in several areas. Classification is quite a challenging field in text mining as it requires prepossessing steps to convert unstructured data to structured information. With the increase in the number of news it has got difficult for users to access news of his interest which makes it a necessity to categories news so that it could be easily accessed. Categorization refers to grouping that allows easier navigation among articles. Internet news needs to be divided into categories

### The Dataset
Text documents are one of the richest sources of data for businesses.

We’ll use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech.

The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.


1. **Summary of the project**: We are classifying BBC news articles into five categories using natural Language Processing and Machine Learning. The five news topics are: Politics, Entertainment, Sports, Technology, and Business. The goal of this project is to create a text classifier that will streamline the process of categorizing news publications. 
<br><br>
2. **Summary of the data**: The dataset consists of 2225 news articles extracted from the BBC website between 2004 and 2005. It was published open source by Insight Resources and was collected by UC Davis for research. The class distribution is as follows:
```
business: 510
entertainment: 386
politics: 417 
sports: 511
technology: 401 
```

### PySpark – Overview
Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this.

PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them.

PySpark - SparkContext
![image.png](attachment:image.png)

#### Install pyspark

In [None]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 77kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 18.5MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=2c6b2e8def6e85cefc86f5fdda9cf2ec86ddb272fd614868abc7dc7c8293ae7f
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
import os
import re
import csv
import glob
import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark import SparkFiles
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.mllib.classification import *

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

In [None]:
#Load the dataset
def read_file(main_df, category):
  for filename in glob.glob('/content/drive/My Drive/AI_Project/NLP/BBC-News-Classification/Data/' + category + '/*'):
    df = pd.read_csv(filename, sep = "\n", header = None, quoting = csv.QUOTE_NONE)
    df = df.transpose()
    df['text'] = df.apply(lambda x: '\n'.join(x.dropna().astype(str)),axis = 1)
    df = df.drop(df.columns[:-1], axis = 1)
    df['label'] = category
    main_df = pd.concat([main_df, df], ignore_index= True)
  return main_df

In [None]:
#Add respective label
df_news = pd.DataFrame(columns = ['text', 'label'])
list = ['business', 'politics', 'entertainment', 'sport', 'tech']
for genre in list:
  df_news = read_file(df_news, genre)
df_news.head()

Unnamed: 0,text,label
