<a href="https://colab.research.google.com/github/Jaseko1989/ClassificationProject/blob/main/Data%20preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Overview**
The project aims at creating classification models using Python and deploy it as a web application with Streamlit for a news outlet

# **Workflow**

*  Data loading
*  Preprocessing
*  Model training
*  Evaluation
*  Final deployment








# **Targeted stakeholders**
*  Editorial team
*  IT/tech support
*  Management, readers
*  Readers





# **End Results**
Improved content categorization, operational efficiency, and enhanced user experience.

# **Dataset**
The dataset (both train.csv and test.csv)is comprising of news articles that need to be classified into categories based on their content, including Business, Technology, Sports, Education, and Entertainment.

# **Dataset Features**


*   **Headlines:** The headline or title of the news article.
*   **Description:** A brief summary or description of the news article.


*   **Content:** The full text content of the news article.

*  **URL:** The URL link to the original source of the news article.
*  **Category**:The category or topic of the news article (e.g., Business, Education, Entertainment, Sports, Technology).


In [3]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


IMPORTING DATASET

In [4]:
dataframe=pd.read_csv('train.csv')
x=dataframe.iloc[:,:-1].values
y=dataframe.iloc[:,-1].values

In [5]:
print(x)

[['RBI revises definition of politically-exposed persons for KYC purpose'
  'The central bank has also asked chairpersons and chief executives of banks and other financial services, which are into lending business, to make the changes effective immediately.'
  'The Reserve Bank of India (RBI) has changed the definition of Politically-Exposed Persons (PEPs) under its norms, a move that will make it easier for those individuals to carry out various banking transactions, including availing loans.\r\nCertain changes have been made in the RBI’s Know Your Customer (KYC) norms.\r\nThe earlier norms pertaining to PEPs were open-ended and there was a lack of clarity on the definition, apparently leading to issues for bankers, parliamentarians and others. There were also concerns in certain quarters that PEPs were finding it difficult to get loans or open bank accounts.\r\nADVERTISEMENT\r\nIn the amended KYC master direction, the central bank defines PEPs as “individuals who are or have been ent

In [6]:
print(y)

['business' 'business' 'business' ... 'technology' 'technology'
 'technology']


In [7]:
#Make a copy of the dataframe to retail to retain its original features
df=dataframe.copy()
df.head()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


HANDLING MISSING VALUES

In [8]:
df.isna()

Unnamed: 0,headlines,description,content,url,category
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
5515,False,False,False,False,False
5516,False,False,False,False,False
5517,False,False,False,False,False
5518,False,False,False,False,False


In [9]:
df.isna().count()

headlines      5520
description    5520
content        5520
url            5520
category       5520
dtype: int64

In [10]:
#deleting rows with null values
df.dropna()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business
...,...,...,...,...,...
5515,"Samsung sends out invites for ‘Unpacked 2024’,...",Samsung is most likely to announce next-genera...,Samsung plans to reveal the next-generation fl...,https://indianexpress.com/article/technology/t...,technology
5516,Google Pixel 8 Pro accidentally appears on off...,The Pixel 8 Pro will most likely carry over it...,Google once again accidentally gave us a glimp...,https://indianexpress.com/article/technology/m...,technology
5517,Amazon ad on Google Search redirects users to ...,Clicking on the real looking Amazon ad will op...,A new scam seems to be making rounds on the in...,https://indianexpress.com/article/technology/t...,technology
5518,"Elon Musk’s X, previously Twitter, now worth l...","Elon Musk's X, formerly Twitter, has lost more...",More than a year after Elon Musk acquired Twit...,https://indianexpress.com/article/technology/s...,technology


SPLITTING DATASET INTO TRAINING AND TEST SETS

In [11]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [12]:
print(x_train)

[['What keeps Sam Altman up at night? OpenAI CEO reveals his darkest fears about AI'
  'At a recent event, Sam Altman shared his unique perspective on a question that has kept many wondering for days.'
  'The CEO of OpenAI, who had a rather turbulent year, recently opened up about his deepest fears. And, this has nothing to do with the boardroom drama that unfolded last month. Sam Altman recently attended the Hope Global Forums 2023. The young CEO was a guest speaker at the event that witnessed the participation of an array of luminaries from diverse backgrounds.\r\nDuring the session ‘The Future of AI’, Altman was asked about what kept him up at night. The 37-year-old went on to describe the sci-fi stories that he watched or read while growing up. According to him, these were “really compelling stories” such as the mind virus that got in his brain, or AI’s going rogue. Describing these thoughts, Altman said that there was “something about them that really resonates with us.”\r\nThis p

In [13]:
print(x_test)

[['Babar Azam, Shahid Afridi moved to safety after explosion near stadium in Quetta'
  'An exhibition match of the Pakistan Super League (PSL) at the Nawab Akbar Bugti Stadium was halted for some time following an explosion in the Police Lines area, which left five people injured.'
  'Top Pakistani cricketers, including captain Babar Azam and Shahid Afridi, among others, were taken to the safety of the dressing room after a terror attack a few miles down the road where they were playing on Sunday.\r\nAn exhibition match of the Pakistan Super League (PSL) at the Nawab Akbar Bugti Stadium was halted for some time following an explosion in the Police Lines area, which left five people injured.\r\nA senior police officer said that rescue work had been completed at the site and the injured had been taken to hospital.\r\nThe outlawed Tehreek-e-Taliban Pakistan (TTP) claimed responsibility for the attack in a statement on Sunday. It stated that the security officials were targeted in the blas

In [14]:
print(y_train)

['technology' 'entertainment' 'technology' ... 'technology' 'technology'
 'business']


In [15]:
print(y_test)

['sports' 'education' 'entertainment' ... 'education' 'business'
 'technology']


In [16]:
df.describe()

Unnamed: 0,headlines,description,content,url,category
count,5520,5520,5520,5520,5520
unique,5512,5508,5513,5514,5
top,International Education Day 2024: Know why it ...,The university has removed the requirements of...,Grand Slam fever grips tennis fans all over th...,https://indianexpress.com/article/education/kc...,education
freq,2,2,5,2,1520


FEATURE SCALING