<a href="https://colab.research.google.com/github/Jaseko1989/ClassificationProject/blob/main/Data%20Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Overview**
The project aims at creating classification models using Python and deploy it as a web application with Streamlit for a news outlet

# **Workflow**

*  Data loading
*  Preprocessing
*  Model training
*  Evaluation
*  Final deployment








# **Targeted stakeholders**
*  Editorial team
*  IT/tech support
*  Management, readers
*  Readers





# **End Results**
Improved content categorization, operational efficiency, and enhanced user experience.

# **Dataset**
The dataset (both train.csv and test.csv)is comprising of news articles that need to be classified into categories based on their content, including Business, Technology, Sports, Education, and Entertainment.

# **Dataset Features**


*   **Headlines:** The headline or title of the news article.
*   **Description:** A brief summary or description of the news article.


*   **Content:** The full text content of the news article.

*  **URL:** The URL link to the original source of the news article.
*  **Category**:The category or topic of the news article (e.g., Business, Education, Entertainment, Sports, Technology).


In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


IMPORTING DATASET

In [3]:
dataframe=pd.read_csv('train.csv')
x=dataframe.iloc[:,:-1].values
y=dataframe.iloc[:,-1].values

In [4]:
print(x)

[['RBI revises definition of politically-exposed persons for KYC purpose'
  'The central bank has also asked chairpersons and chief executives of banks and other financial services, which are into lending business, to make the changes effective immediately.'
  'The Reserve Bank of India (RBI) has changed the definition of Politically-Exposed Persons (PEPs) under its norms, a move that will make it easier for those individuals to carry out various banking transactions, including availing loans.\r\nCertain changes have been made in the RBI’s Know Your Customer (KYC) norms.\r\nThe earlier norms pertaining to PEPs were open-ended and there was a lack of clarity on the definition, apparently leading to issues for bankers, parliamentarians and others. There were also concerns in certain quarters that PEPs were finding it difficult to get loans or open bank accounts.\r\nADVERTISEMENT\r\nIn the amended KYC master direction, the central bank defines PEPs as “individuals who are or have been ent

In [5]:
print(y)

['business' 'business' 'business' ... 'technology' 'technology'
 'technology']


In [6]:
#Make a copy of the dataframe to retail to retain its original features
df=dataframe.copy()
df.head()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


HANDLING MISSING VALUES

In [7]:
df.isna()

Unnamed: 0,headlines,description,content,url,category
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
5515,False,False,False,False,False
5516,False,False,False,False,False
5517,False,False,False,False,False
5518,False,False,False,False,False


In [9]:
df.isna().count()

headlines      5520
description    5520
content        5520
url            5520
category       5520
dtype: int64

In [10]:
#deleting rows with null values
df.dropna()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business
...,...,...,...,...,...
5515,"Samsung sends out invites for ‘Unpacked 2024’,...",Samsung is most likely to announce next-genera...,Samsung plans to reveal the next-generation fl...,https://indianexpress.com/article/technology/t...,technology
5516,Google Pixel 8 Pro accidentally appears on off...,The Pixel 8 Pro will most likely carry over it...,Google once again accidentally gave us a glimp...,https://indianexpress.com/article/technology/m...,technology
5517,Amazon ad on Google Search redirects users to ...,Clicking on the real looking Amazon ad will op...,A new scam seems to be making rounds on the in...,https://indianexpress.com/article/technology/t...,technology
5518,"Elon Musk’s X, previously Twitter, now worth l...","Elon Musk's X, formerly Twitter, has lost more...",More than a year after Elon Musk acquired Twit...,https://indianexpress.com/article/technology/s...,technology


# Data Preprocessing
Preprocess the data by cleaning text, removing stop words, and transforming text data into numerical features using TF-IDF or Count Vectorizer.

In [2]:
#prepare the data
x=df['content']
y=df['category']

In [5]:
#split the data into training and set
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [26]:
# feature scaling
scaler=StandardScaler()



